Forum: War Ensemble BBS

Intel's Software Defined Super Cores

From John Savard@[email protected] to comp.arch on Mon Sep 15 23:54:12 2025

From Newsgroup: comp.arch

When I saw a post about a new way to do OoO, I had thought it might be
talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

This is a sound idea, but one may not find enough opportunities to use it.

Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

John Savard
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Savard@[email protected] to comp.arch on Tue Sep 16 00:03:51 2025

From Newsgroup: comp.arch

On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:

Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

On further reflection, this may be equivalent to re-inventing out-of-order execution.

John Savard
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Mon Sep 15 17:19:36 2025

From Newsgroup: comp.arch

On 9/15/2025 4:54 PM, John Savard wrote:

When I saw a post about a new way to do OoO, I had thought it might be talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

This is a sound idea, but one may not find enough opportunities to use it.

Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

Two weeks ago, I saw this in Tom's Hardware.

https://www.tomshardware.com/pc-components/cpus/intel-patents-software-defined-supercore-mimicking-ultra-wide-execution-using-multiple-cores

But at this point, it is just a patent. While it *might* get included
in a future product, it seems a long way away, if ever.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Mon Sep 15 17:56:28 2025

From Newsgroup: comp.arch

On 9/15/2025 4:54 PM, John Savard wrote:

When I saw a post about a new way to do OoO, I had thought it might be talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

We would have to somehow tell the system that the program only uses a
single thread, right? Not exactly sure how the sync is going to work
with regard to multi-threaded and/or multi process programs?

A single threaded program runs, then it calls into a function that
creates a thread. Humm...

This is a sound idea, but one may not find enough opportunities to use it.

Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

Can one get something kind of akin to it by a clever use of affinity
masks? But, those are not 100% guaranteed?
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@[email protected] to comp.arch on Tue Sep 16 10:13:35 2025

From Newsgroup: comp.arch

Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

Sounds like [multiscalar processors](doi:multiscalar processor)

Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@[email protected] to comp.arch on Tue Sep 16 10:15:04 2025

From Newsgroup: comp.arch

Sounds like [multiscalar processors](doi:multiscalar processor)

^^^^^^^^^^^^^^^^^^^^^
10.1145/223982.224451

[ I guess it can be useful to actully look at what one pasts before
pressing "send", eh? ]

Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Tue Sep 16 15:10:09 2025

From Newsgroup: comp.arch

Stefan Monnier <[email protected]> schrieb:

[ I guess it can be useful to actully look at what one pasts before
pressing "send", eh? ]

This is sooooo 2010's. Next, you'll be claming it makes sense to
think before writing, and where would we be then? Not in the age
of modern social media, that's for sure.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Tue Sep 16 15:50:38 2025

From Newsgroup: comp.arch

John Savard <[email protected]d> posted:

When I saw a post about a new way to do OoO, I had thought it might be talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

This is a sound idea, but one may not find enough opportunities to use it.

Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add extra connections between cores to make it work.

Andy Glew was working on stuff like this 10-15 years ago

John Savard

--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@[email protected] to comp.arch on Tue Sep 16 13:01:30 2025

From Newsgroup: comp.arch

On Tue, 16 Sep 2025 00:03:51 -0000 (UTC), John Savard <[email protected]d> wrote:

On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:

Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

On further reflection, this may be equivalent to re-inventing out-of-order >execution.

John Savard

Sounds more like dynamic micro-threading.

Over the years I've seen a handful of papers about compile time micro-threading: that is the compiler itself identifies separable
dependency chains in serial code and rewrites them into deliberate
threaded code to be executed simultaneously.

It is not easy to do under the best of circumstances and I've never
seen anything about doing it dynamically at run time.

To make a thread worth rehosting to another core, it would need to be
(at least) many 10s of instructions in length. To figure this out
dynamically at run time, it seems like you'd need the decode window to
be 1000s of instructions and a LOT of "figure-it-out" circuitry.

MMV, but to me it doesn't seem worth the effort.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@[email protected] to comp.arch on Wed Sep 17 11:54:09 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

John Savard <[email protected]d> posted:

When I saw a post about a new way to do OoO, I had thought it might be
talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- >> intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting
programs into chunks that can be performed in parallel on different cores, >> where the cores are intimately connected in order to make this work.

This is a sound idea, but one may not find enough opportunities to use it. >>
Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

Andy Glew was working on stuff like this 10-15 years ago

That's what immediately fell to my mind as well, it looks a lot like
trying some of his ideas about scouting micro-threads, doing work in the
hope that it will turn out useful.

To me it sounds like it is related to eager execution, except skipping
further forward into upcoming code.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Wed Sep 17 14:34:09 2025

From Newsgroup: comp.arch

On Wed, 17 Sep 2025 11:54:09 +0200
Terje Mathisen <[email protected]> wrote:

MitchAlsup wrote:

John Savard <[email protected]d> posted:

When I saw a post about a new way to do OoO, I had thought it
might be talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core-
intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by
splitting programs into chunks that can be performed in parallel
on different cores, where the cores are intimately connected in
order to make this work.

This is a sound idea, but one may not find enough opportunities to
use it.

Although it's called "inverse hyperthreading", this technique
could be combined with SMT - put the chunks into different threads
on the same core, rather than on different cores, and then one
wouldn't need to add extra connections between cores to make it
work.

Andy Glew was working on stuff like this 10-15 years ago

That's what immediately fell to my mind as well, it looks a lot like
trying some of his ideas about scouting micro-threads, doing work in
the hope that it will turn out useful.

To me it sounds like it is related to eager execution, except
skipping further forward into upcoming code.

Terje

The question is what is most likely meaning of the fact of patenting?
IMHO, it means that they explored the idea and decided against going in
this particular direction in the near and medium-term future.

I think that when Intel actually plans to use particular idea then they
keep the idea secret for as long as they can and either don't patent at
all or apply for patent after release of the product.
I can be wrong about it.

On the other hand,
Some of the people that issued the patent appear to be leading figures
in Intel's P-core teams. Some of them 1 year ago gave representations
about advantages of removal of SMT. Removal of SMT and this super-core
idea can be considered complimentary - both push into direction of
cores with smaller # of EU pipes. So, may be, an idea was seriously
considered for Intel products in mid-term future.
Anyway, couple of months ago Tan himself said that Intel is reversing
the decision to remove SMT. Which probably means that all their mid-term
future plans are undergoing significant changes.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Wed Sep 17 13:46:33 2025

From Newsgroup: comp.arch

Michael S <[email protected]> writes:

The question is what is most likely meaning of the fact of patenting?
IMHO, it means that they explored the idea and decided against going in
this particular direction in the near and medium-term future.

I think that when Intel actually plans to use particular idea then they
keep the idea secret for as long as they can and either don't patent at
all or apply for patent after release of the product.
I can be wrong about it.

That would risk that somebody without patent exchange agreements with
Intel patents the invention first (whether independently developed or
due to a leak). Advantages of such a strategy: Companies with patent
exchange agreements learn even later about the invention, and the
patent expires at a later date.

I remember an article about alias prediction (IIRC for executing
stores before architecturally earlier loads), where the author read a
patent from Intel and did some measurements on a released Intel CPU,
and confirming that they actually implemented what the patent
described.

If you find that article, and compare the data when the patent was
submitted to the date of the release of the processor, you can check
your theory.

Some of them 1 year ago gave representations
about advantages of removal of SMT.

I did not read any accounts of that that appeared particularly
knowledgeable. What are the advantages, or where can I read about
these presentations?

Removal of SMT and this super-core
idea can be considered complimentary - both push into direction of
cores with smaller # of EU pipes.

What do you mean by that? Narrower cores? In recent years cores seem
to have exploded in width. From 1995 up to and including 2018 Intel
produced 3-wide and 4-wide designs (with 4-wide coming IIRC with Sandy
Bridge in 2011), and since then even the Skymont E-core has grown to
8-wide, with 26 execution ports and 16-wide retirement. And other CPU manufacturers have also increased the widths of their CPUs.

It seems that there has been a breakthrough in extracting ILP, making
wider cores pay off better, a breakthrough in designing wider register
renamers and making other structures wider, or both.

Pushing for narrower cores appears unplausible to me at this stage.

Concerning the removal of SMT, I can only guess, but that did not
appear unplausible to me with Intel's hybrid CPUs: They have P-cores
for fast single-thread performance, and lots of E-cores for
multi-thread performance. You allocate threads that need
single-thread performance to P-cores and threads that don't to
E-cores. If you have even more tasks, i.e., a heavily multi-threaded
load, do you want to slow down the threads that run on the P-cores by
switching them to SMT mode, also increasing the already-high power
consumption of the P-cores, lowering the clock of everything to stay
within the power limit, and thus possibly the performance? If not,
you don't need SMT.

Still, after touting the SMT horn for so long, I don't expect that
such considerations are the only ones. There must be a significant
advantage in design complexity or die area when leaving it away
(contradicting the earlier claim that SMT costs very little).

Concerning super cores, whatever it is, my guess is that the idea is
to try to extract even more performance from (as far as software is
concerned) single-threaded programs than achievable with the wide
cores of today.

Anyway, couple of months ago Tan himself said that Intel is reversing
the decision to remove SMT.

On the servers, they do not follow the hybrid strategy, for whatever
reason, so the thoughts above don't apply there. And maybe they found
that the cloud providers want SMT, in order to sell their customers
twice as many "CPUs".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Wed Sep 17 13:07:49 2025

From Newsgroup: comp.arch

On 9/15/2025 6:54 PM, John Savard wrote:

When I saw a post about a new way to do OoO, I had thought it might be talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

This is a sound idea, but one may not find enough opportunities to use it.

Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

Say, more cores and less power use, at the possible expense of some
amount of performance.

...

John Savard

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Wed Sep 17 18:53:24 2025

From Newsgroup: comp.arch

BGB <[email protected]> schrieb:

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.

For a later perspective, see

https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Wed Sep 17 18:54:01 2025

From Newsgroup: comp.arch

BGB <[email protected]> posted:

On 9/15/2025 6:54 PM, John Savard wrote:

When I saw a post about a new way to do OoO, I had thought it might be talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

This is a sound idea, but one may not find enough opportunities to use it.

Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add extra connections between cores to make it work.

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

Transmeta tried and failed to do this.

Say, more cores and less power use, at the possible expense of some
amount of performance.

...

John Savard

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Wed Sep 17 23:00:15 2025

From Newsgroup: comp.arch

On Wed, 17 Sep 2025 18:53:24 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

BGB <[email protected]> schrieb:

Still sometimes it seems like it is only a matter of time until
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator via
dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

Not really.

First, translation on the fly does not count.

Second, even for translation on the fly, only ancient K6 worked that
way. Their later chip did a lot of work at the level of macro-ops,
which in majority of cases have one-to-one correspondence to original
x86 load-op and load-op-store instructions.

Actually, I am not 100% sure about Bulldozer and derivatives, but K7,
K8 and all generations of Zen are using macro-ops.

See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.

Badly outdated text.

For a later perspective, see

https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@[email protected] (Scott Lurndal) to comp.arch on Wed Sep 17 20:19:14 2025

From Newsgroup: comp.arch

BGB <[email protected]> writes:

On 9/15/2025 6:54 PM, John Savard wrote:

When I saw a post about a new way to do OoO, I had thought it might be
talking about this:

https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- >> intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough

Basically, Intel proposes to boost single-thread performance by splitting
programs into chunks that can be performed in parallel on different cores, >> where the cores are intimately connected in order to make this work.

This is a sound idea, but one may not find enough opportunities to use it. >>
Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.

That was tried three decades ago. https://en.wikipedia.org/wiki/Transmeta

--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@[email protected] to comp.arch on Wed Sep 17 21:33:17 2025

From Newsgroup: comp.arch

According to BGB <[email protected]>:

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.

That sounds a whole lot like what Transmeta did 25 years ago:

https://en.wikipedia.org/wiki/Transmeta_Crusoe

They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Thu Sep 18 05:27:15 2025

From Newsgroup: comp.arch

BGB <[email protected]> writes:

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.

Intel has already done so, although AFAIK not at the firmware level:
Every IA-64 CPU starting with the Itanium II did not implement IA-32
in hardware (unlike the Itanium), but instead used dynamic translation.

There is no reason for Intel to repeat this mistake, or for anyone
else to go there, either.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Thu Sep 18 05:31:29 2025

From Newsgroup: comp.arch

Thomas Koenig <[email protected]> writes:

BGB <[email protected]> schrieb:

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

That's nonsense; regulars of this groups should know better, at least
this nonsense has been corrected often enough. E.g., I wrote in <[email protected]>:

|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions with
|around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop" |standing for "RISC").

Another difference is that the OoO engine that sees the uOps performs
only a very small part of the functionality of branches, with the
majority performed by the front end. I.e., there is no branching in
the OoO engine that sees the uOps, at the most it confirms the branch prediction, or diagnoses a misprediction, at which point the OoO
engine is out of a job and has to wait for the front end; possibly
only the ROB (which deals with instructions again) resolves the
misprediction and kicks the front end into action, however.

As Mitch Alsup has written, AMD has its MacroOps (load-op and RMW) in
addition to the Rops. It's not entirely clear which parts of the
engine see MacroOps and ROPs, but my impression was that the MacroOps
are not split into ROPs for the largest part of the OoO engine.

See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.

From 1998. Unfortunately, there are not many more recent books about
the microarchitecture of OoO CPUs. What I have found:

Modern Processor Design: Fundamentals of Superscalar Processors
John Paul Shen, Mikko H. Lipasti
McGraw-Hill
656 pages
published 2004 or so (don't let the 2013 date from the reprint fool you) Discusses CPU design (not just OoO) using various real CPUs from the
1990s as example.

Processor Microarchitecture -- An Implementation Perspective
Antonio Gonzalez , Fernando Latorre , Grigorios Magklis
Springer
published 2010
Relatively short, discusses the various parts of an OoO CPU and how to implement them.

Henry Wong
A Superscalar Out-of-Order x86 Soft Processor for FPGA
Ph.D. thesis, U. Toronto
https://www.stuffedcow.net/files/henry-thesis-phd.pdf
Slides: https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf published 2017

A problem is that the older books don't cover recent developments such
as alias prediction and that Wong was limited by what a single person
can do (his work was not part of a larger research project at
U. Toronto), as well as what fits into an FPGA.

BTW, Wong's work can be seen as a refutation of BGB's statement: He
chose to implement IA-32; on slide 14 of <https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf> he
states "It’s easy to implement!".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Thu Sep 18 06:14:30 2025

From Newsgroup: comp.arch

John Levine <[email protected]> writes: >https://en.wikipedia.org/wiki/Transmeta_Crusoe

They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.

It definitely was. However, even a modern high-performance OoO cores
like Apple M1-M4's P-cores or on Qualcomm's Oryon, the performance of dynamically-translated AMD64 code is usually slower than on comparable
CPUs from Intel and AMD.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Thu Sep 18 03:39:57 2025

From Newsgroup: comp.arch

On 9/17/2025 4:33 PM, John Levine wrote:

According to BGB <[email protected]>:

Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.

That sounds a whole lot like what Transmeta did 25 years ago:

https://en.wikipedia.org/wiki/Transmeta_Crusoe

They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.

Might be different now:
25 years ago, Moore's law was still going strong, and the general
concern was more about maximizing scalar performance rather than energy efficiency or core count (and, in those days, processors were generally single-core).

Now we have a different situation:
Moore's law is dying off;
Scalar CPU performance has hit a plateau;
And, for many uses, performance is "good enough";
A lot more software can make use of multi-threading;
...

Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance
on a comparably smaller and cheaper core, and with a somewhat better "performance per watt" metric.

So, one possibility could be, rather than a small number of big/fast
cores (either VLIW or OoO), possibly a larger number of smaller cores.

The cores could maybe be LIW or in-order RISC.

One possibility could be that virtual processors don't run on a single
core, say:
The logical cores exist more as VMs each running a virtual x86 processor
core;
The dynamic translation doesn't JIT translate to a linear program.

Say:
Breaks code into traces;
Each trace uses something akin to CSP mixed with Pi-Calculus;
Address translation is explicit in the ISA, with specialized ISA level memory-ordering and control-flow primitives.

For example, there could be special ISA level mechanisms for submitting
a job to a local job-queue, and pulling a job from the queue.
Memory accesses could use a special "perform a memory access or branch-subroutine" instruction ("MEMorBSR"), where the MEMorBSR
operations will try to access memory, either continuing to the next instruction (success) or Branching-to-Subroutine (access failed).

Where the failure cases could include (but not limited to) TLB miss;
access fault; memory ordering fault; ...

The "memory ordering fault" case could be, when traces are submitted to
the queue, if they access memory, they are assigned sequence numbers
based on Load and Store operations. When memory is accessed, the memory
blocks in the cache could be marked with sequence numbers when read or modified. On access, it could detect if/when memory access have
out-of-order sequence numbers, and then fall back to special-case
handling to restore the intended order (reverting any "uncommitted"
writes, and putting the offending blocks back into the queue to be
re-run after the preceding blocks have finished).

Possibly, the caches wouldn't directly commit stores to memory, but
instead could keep track of a group of cache lines as an "in-flight" transaction. In this case, it could be possible for a "logically older"
block to see the memory as it was before a more recent transaction, but
an out-of-order write could be detected via sequence numbers (if seen,
it would mean a "future" block had run but had essentially read stale data).

Once a block is fully committed (after all preceding blocks are
finished) its contents can be written back out to main RAM.
Could be held in an area of RAM local to the group of cores running the logical core.

Possibly, such a core might actually operate in multiple address spaces:
Virtual Memory, via the transaction oriented MEMorBSR mechanism;
There would likely be an explicit TLB here.
So, TLB Miss handling could be essentially a runtime call.
Local Memory:
Physical Address, small non-externally-visible SRAM;
Divided into Core-Local and Group-Shared areas;
Physical Memory:
External DRAM or similar;
Resembles more traditional RAM access (via Load/Store Ops);
Could be used for VM tasks and page-table walks.

Would likely require significant hardware level support for things like job-queues and synchronization mechanisms.

One possibility could be that some devices could exist local to a group
of cores, which then have a synchronous "first come, first serve" access pattern (possibly similar to how my existing core design manages MMIO).

Possibly it could work by passing fixed-size messages over a bus, with
each request/response pair to a device being synchronous.

Possibly the JIT could try to infer possible memory aliasing between
traces, and enforce sequential ordering if alias is likely. This because performing the operations in the correct order the first time is likely
to be cheaper than detecting an ordering violation and rolling back a transaction.

Whereas proving that traces can't alias is likely to be a much harder
problem than inferring a probable absence of aliasing. If no order
violations occur during operation, it can be safely assumed that no
memory aliasing happened.

Maintaining transactions would complicate the cache design though, since
now there is a problem that the cache line can't be written back or
evicted until its write-associated sequence is fully committed.

Might also need to be separate queue spots for "tasks currently being
worked on" vs "to be done after the current jobs are done". Say, for
example, if a job needs to be rolled-back and re-run, it would still
need to come before jobs that are further in the future relative to itself.

Unlike memory, register ordering is easier to infer statically, at least
in the absence of dynamic branching.

Might need to enforce ordering in cases where:
Dynamic branch occurs and the path can't be followed statically;
A following trace would depend on a register modified in a preceding trace;
...

As for how viable any of this is, I don't know...

The VM could be a lot simpler if one assumes a single threaded VM.

Also unclear is if an ISA could be designed in a way to keep overheads
low enough (would be a waste if the multi-threaded VM is slower than a
single threaded VM would have been). But, this would require a lot of
exotic mechanisms, so dunno...

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Thu Sep 18 03:58:16 2025

From Newsgroup: comp.arch

On 9/18/2025 1:14 AM, Anton Ertl wrote:

John Levine <[email protected]> writes:

https://en.wikipedia.org/wiki/Transmeta_Crusoe

They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.

It definitely was. However, even a modern high-performance OoO cores
like Apple M1-M4's P-cores or on Qualcomm's Oryon, the performance of dynamically-translated AMD64 code is usually slower than on comparable
CPUs from Intel and AMD.

But, AFAIK the ARM cores tend to use significantly less power when
emulating x86 than a typical Intel or AMD CPU, even if slower.

Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
slower, it may still win in Perf/W and similar...

Then there is also Perf/$, and if such a CPU can win in both Perf/W and Perf/$, then it can still win even if it is slower, by throwing more
cores at the problem.

Though, the possibly interesting idea could be trying for a
multi-threaded translation rather than a single threaded translation.
But, to have any hope, a multi-threaded translation is likely to need
exotic ISA features; whereas a single threaded VM could probably run
mostly OK on normal ARM or RISC-V or similar (well, assuming a world
where RiSC-V addresses some more of its weak areas; but then again, with recent proposals for indexed load/store and auto-increment popping up,
this is starting to look more likely...).

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Thu Sep 18 17:51:36 2025

From Newsgroup: comp.arch

On Thu, 18 Sep 2025 05:27:15 GMT
[email protected] (Anton Ertl) wrote:

BGB <[email protected]> writes:

Still sometimes it seems like it is only a matter of time until
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an x86
chip by running *everything* in a firmware level emulator via
dynamic translation.

Intel has already done so, although AFAIK not at the firmware level:
Every IA-64 CPU starting with the Itanium II did not implement IA-32
in hardware (unlike the Itanium), but instead used dynamic
translation.

That's imprecise.
First couple of generations of Itanium 2 (McKinley, Madison) still had
IA-32 hardware. Gone in Montecito (2006).
Dynamic translation of application code was available much earlier,
indeed, but early removal of [crappy] hardware colution was probably
considered too risky.

There is no reason for Intel to repeat this mistake, or for anyone
else to go there, either.

- anton

As said by just about everybody, BGB's proposal is most similar
to Transmeta. What was not said by everybody is that similar approach
was tried for Arm, by NVidia none the less. https://en.wikipedia.org/wiki/Project_Denver

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Thu Sep 18 16:16:54 2025

From Newsgroup: comp.arch

Thomas Koenig <[email protected]> posted:

BGB <[email protected]> schrieb:

Still sometimes it seems like it is only a matter of time until Intel or AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

With a very loose definition of RISC::

a)Does a RISC ISA contain memory reference address generation from
the pattern [Rbase+Rindex<<scale+Displacement] ??
Some will argue yes, others no.

b) does a RISC ISA contain memory reference instructions that are
combined with arithmetic calculations ??
Some will argue yes, others no.

c) does a RISC ISA contain memory reference instructions that
access memory twice ??? LD-OP-ST :: but the TLB only once ?!?
Most would argue no.

Yet, this is the µISA of K7 and K8. It is only RISC in the very
loosest sense of the word.

And do not get me started on the trap/exception/interrupt model.

See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.

For a later perspective, see

https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@[email protected] to comp.arch on Thu Sep 18 12:33:44 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

Thomas Koenig <[email protected]> writes:

BGB <[email protected]> schrieb:

Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

That's nonsense; regulars of this groups should know better, at least
this nonsense has been corrected often enough. E.g., I wrote in <[email protected]>:

|Not even if the microcode the Intel and AMD chips used was really |RISC-like, which it was not (IIRC the P6 uses micro-instructions with |around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop" |standing for "RISC").

I don't know what you are objecting to - Intel calls its internal
instructions micro-operations or uOps, and AMD calls its Rops.
The term is widely used to mean something that executes internally.
Beyond that it depends on the specific of each micro-architecture.

The number of bits has nothing to do with what it is called.
If this uOp was for a ROB style design where all the knowledge about
each instruction including register ids, immediate data,
scheduling info, result data, status, is stored in a single ROB entry,
then 100 bits sounds pretty small so I'm guessing that was a 32-bit cpu.

Another difference is that the OoO engine that sees the uOps performs
only a very small part of the functionality of branches, with the
majority performed by the front end. I.e., there is no branching in
the OoO engine that sees the uOps, at the most it confirms the branch prediction, or diagnoses a misprediction, at which point the OoO
engine is out of a job and has to wait for the front end; possibly
only the ROB (which deals with instructions again) resolves the
misprediction and kicks the front end into action, however.

And a uOp triggers that action sequence.
I don't see the distinction you are trying to make.

As Mitch Alsup has written, AMD has its MacroOps (load-op and RMW) in addition to the Rops. It's not entirely clear which parts of the
engine see MacroOps and ROPs, but my impression was that the MacroOps
are not split into ROPs for the largest part of the OoO engine.

AMD explains there terminology here but note that the relationship
between Macro-Ops and Micro-Ops is micro-architecture specific.

A Seventh-Generation x86 Microprocessor, 1999 https://www.academia.edu/download/70925991/4.79985120211001-19357-4pufup.pdf

"An [micro-]OP is the minimum executable entity understood by the machine."
A macro-op is a bundle of 1 to 3 micro-ops.
Simple instructions map to 1 macro and 1-3 micro ops
and this mapping is done in the decoder.
Complex instructions map to one or more "micro-lines" each of which
consists of 3 macro-ops (of 1-3 micro-ops each) pulled from micro-code ROM.

See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.

From 1998. Unfortunately, there are not many more recent books about
the microarchitecture of OoO CPUs. What I have found:

Modern Processor Design: Fundamentals of Superscalar Processors
John Paul Shen, Mikko H. Lipasti
McGraw-Hill
656 pages
published 2004 or so (don't let the 2013 date from the reprint fool you) Discusses CPU design (not just OoO) using various real CPUs from the
1990s as example.

Processor Microarchitecture -- An Implementation Perspective
Antonio Gonzalez , Fernando Latorre , Grigorios Magklis
Springer
published 2010
Relatively short, discusses the various parts of an OoO CPU and how to implement them.

Henry Wong
A Superscalar Out-of-Order x86 Soft Processor for FPGA
Ph.D. thesis, U. Toronto https://www.stuffedcow.net/files/henry-thesis-phd.pdf
Slides: https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf published 2017

A problem is that the older books don't cover recent developments such
as alias prediction and that Wong was limited by what a single person
can do (his work was not part of a larger research project at
U. Toronto), as well as what fits into an FPGA.

BTW, Wong's work can be seen as a refutation of BGB's statement: He
chose to implement IA-32; on slide 14 of <https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf> he
states "It’s easy to implement!".

- anton

Other micro-architecture related sources since 2000:

Book
A Primer on Memory Consistency and Cache Coherence 2nd Ed, 2020
Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood

Dissertation
Complexity and Correctness of a Super-Pipelined Processor, 2005
Jochen Prei�

Book
General-Purpose Graphics Processor Architectures, 2018
Aamodt, Wai Lun Fung, Rogers

Book
Microprocessor Architecture
From Simple Pipelines to Chip Multiprocessors, 2010
Jean-Loup Baer

Book
Processor Microarchitecture An Implementation Perspective, 2011
Antonio Gonz�lez, Fernando Latorre, and Grigorios Magklis

This is a bit introductory level:

Book
Computer Organization and Design
The Hardware/Software Interface: RISC-V Edition, 2018
Patterson, Hennessy

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Thu Sep 18 20:26:29 2025

From Newsgroup: comp.arch

On Thu, 18 Sep 2025 12:33:44 -0400
EricP <[email protected]> wrote:

Anton Ertl wrote:

Thomas Koenig <[email protected]> writes:

BGB <[email protected]> schrieb:

Still sometimes it seems like it is only a matter of time until
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator via
dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

That's nonsense; regulars of this groups should know better, at
least this nonsense has been corrected often enough. E.g., I wrote
in <[email protected]>:

|Not even if the microcode the Intel and AMD chips used was really |RISC-like, which it was not (IIRC the P6 uses micro-instructions
with |around 100bits, and the K7 has a read-write Rop (with the "R"
of "Rop" |standing for "RISC").

I don't know what you are objecting to - Intel calls its internal instructions micro-operations or uOps, and AMD calls its Rops.

No, they don't. They stopped using term Rops almost 25 years ago.
If they used it in early K7 manuals then it was due to inertia (K6
manuals copy&pasted without much of thought given) and partly because
of marketing, because RISC was considered cool.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@[email protected] to comp.arch on Thu Sep 18 14:42:36 2025

From Newsgroup: comp.arch

Michael S wrote:

On Thu, 18 Sep 2025 12:33:44 -0400
EricP <[email protected]> wrote:

Anton Ertl wrote:

Thomas Koenig <[email protected]> writes:

BGB <[email protected]> schrieb:

Still sometimes it seems like it is only a matter of time until
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator via
dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

That's nonsense; regulars of this groups should know better, at
least this nonsense has been corrected often enough. E.g., I wrote
in <[email protected]>:

|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions
with |around 100bits, and the K7 has a read-write Rop (with the "R"
of "Rop" |standing for "RISC").

I don't know what you are objecting to - Intel calls its internal
instructions micro-operations or uOps, and AMD calls its Rops.

No, they don't. They stopped using term Rops almost 25 years ago.
If they used it in early K7 manuals then it was due to inertia (K6
manuals copy&pasted without much of thought given) and partly because
of marketing, because RISC was considered cool.

And the fact that all the RISC processors ran rings around the CISC ones.
So they wanted to promote that "hey, we can go fast too!"

Ok, AMD dropped the "risc" prefix 25 years ago.
That didn't change the way it works internally.

They still use the term "micro op" in the Intel and AMD Optimization guides.
It still means an micro-architecture specific internal simple, discrete
unit of execution, albeit a more complex one as transistor budgets allow.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Thu Sep 18 14:05:04 2025

From Newsgroup: comp.arch

On 9/18/2025 11:16 AM, MitchAlsup wrote:

Thomas Koenig <[email protected]> posted:

BGB <[email protected]> schrieb:

Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

With a very loose definition of RISC::

a)Does a RISC ISA contain memory reference address generation from
the pattern [Rbase+Rindex<<scale+Displacement] ??
Some will argue yes, others no.

b) does a RISC ISA contain memory reference instructions that are
combined with arithmetic calculations ??
Some will argue yes, others no.

c) does a RISC ISA contain memory reference instructions that
access memory twice ??? LD-OP-ST :: but the TLB only once ?!?
Most would argue no.

Yet, this is the µISA of K7 and K8. It is only RISC in the very
loosest sense of the word.

And do not get me started on the trap/exception/interrupt model.

Still reminds me of the LOL of some of the old marketing for the TI
MSP430 trying to pass it off as RISC:
In practice has variable-length instructions (via @PC+ addressing);
Has auto-increment addressing modes and similar;
Most instructions can operate directly on memory;
Has ability to do Mem/Mem operations;
...

In effect, MSP430 being closer to the DEC PDP-11 than it was to much of anything else in the RISC family.

Even SuperH, which also branched off from similar origins, had gone over
to purely 16-bit instructions, and was Load/Store, so more deserving of
the RISC title (though apparently still a lot more PDP-11 flavored than
MIPS flavored).

Their rationale: "But our instruction listing isn't very long, so RISC", nevermind all of the edge cases they hid off in the various addressing
modes and register combinations.

But, yeah, following similar logic to what TI was using, one could look
at something like the Motorola 68000 and be all like, "Yep, looks like
RISC to me"...

...

See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.

For a later perspective, see

https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Thu Sep 18 22:56:22 2025

From Newsgroup: comp.arch

On Thu, 18 Sep 2025 14:42:36 -0400
EricP <[email protected]> wrote:

Michael S wrote:

On Thu, 18 Sep 2025 12:33:44 -0400
EricP <[email protected]> wrote:

Anton Ertl wrote:

Thomas Koenig <[email protected]> writes:

BGB <[email protected]> schrieb:

Still sometimes it seems like it is only a matter of time until
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator
via dynamic translation.

For AMD, that has happend already a few decades ago; they
translate x86 code into RISC-like microops.

That's nonsense; regulars of this groups should know better, at
least this nonsense has been corrected often enough. E.g., I
wrote in <[email protected]>:

|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions
with |around 100bits, and the K7 has a read-write Rop (with the
"R" of "Rop" |standing for "RISC").

I don't know what you are objecting to - Intel calls its internal
instructions micro-operations or uOps, and AMD calls its Rops.

No, they don't. They stopped using term Rops almost 25 years ago.
If they used it in early K7 manuals then it was due to inertia (K6
manuals copy&pasted without much of thought given) and partly
because of marketing, because RISC was considered cool.

And the fact that all the RISC processors ran rings around the CISC
ones.

In 1988. In 1998 - much less so.

So they wanted to promote that "hey, we can go fast too!"

Ok, AMD dropped the "risc" prefix 25 years ago.
That didn't change the way it works internally.

Of course, they did. Several times.
Even Zen3 works non-trivially differently from Zen1 and 2.
If you stopped following in previous millenium it's your problem rather
than their.

They still use the term "micro op" in the Intel and AMD Optimization
guides. It still means an micro-architecture specific internal
simple, discrete unit of execution, albeit a more complex one as
transistor budgets allow.

By that logic every CISC is RISC, because at some internal level they
executes simple operations. Even those with load-ALU pipeline do load
and ALU at separate stages.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Fri Sep 19 09:50:32 2025

From Newsgroup: comp.arch

BGB <[email protected]> writes:

On 9/17/2025 4:33 PM, John Levine wrote:

According to BGB <[email protected]>:

Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.

That sounds a whole lot like what Transmeta did 25 years ago:

https://en.wikipedia.org/wiki/Transmeta_Crusoe

They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.

Might be different now:
25 years ago, Moore's law was still going strong, and the general
concern was more about maximizing scalar performance rather than energy >efficiency or core count (and, in those days, processors were generally >single-core).

IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
2012) has 8 cores. If IA-64 (and dynamically translating AMD64 to it)
would be a good idea nowadays, it would not have been canceled.

How should the number of cores change anything? If you cannot make single-threaded IA-32 or AMD64 programs run at competetive speeds on
IA-64 hardware, how would that inefficiency be eliminated in
multi-threaded programs?

Now we have a different situation:
Moore's law is dying off;

Even if that is the case, how should that change anything about the
relative merits of the two approaches?

Scalar CPU performance has hit a plateau;

True, but again, what's the relevance for the discussion at hand?

And, for many uses, performance is "good enough";

In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.

A lot more software can make use of multi-threading;

Possible, but how would it change things?

Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance
on a comparably smaller and cheaper core, and with a somewhat better >"performance per watt" metric.

Evidence?

Yes, you can run CPUs with Intel P-cores and AMD's non-compact cores
with higher power limits than what the Apple and Qualcomm chips
approximately consume (I have not seen proper power consumption
numbers for these since Anandtech stopped publishing), but you can
also run Intel CPUs and AMD CPUs at low power limits, with much better "performance per watt". It's just that many buyers of these CPUs care
about performance, not performance per watt.

And if you run AMD64 software on your binary translator on CPUs with
e.g., ARM A64 architecture, the performance per watt is worse than
when running it on an AMD64 CPU.

So, one possibility could be, rather than a small number of big/fast
cores (either VLIW or OoO), possibly a larger number of smaller cores.

The cores could maybe be LIW or in-order RISC.

The approach of a large number of small, slow cores has been tried,
e.g., in the TILE64, but has not been successful with that core size.
Other examples are Sun's UltraSparc T1000 and followons, which were
somewhat more successful, but eventually led to the cancellation of
SPARC.

Finally, Intel now offers E-core-only chips for clients (e.g., N100)
and servers (Sierra Forest), but they have not stopped releasing
P-Core-only server CPUs. For the desktop the CPU with the largest
numbers of E-Cores (16) also hase 8 P-cores, so Intel obviously
believes that not all desktop applications are embarrassingly
parallel.

Intel used to have Xeon Phi CPUs with a higher number of narrower
cores, but eventually replaced them with Xeon processors that have
fewer, but more powerful cores.

AMD offers compact-core-only server CPUs with more cores and less
cache per core, but otherwise the same microarchitecture, only with a
much lower clock ceiling. (There is a difference in microarchitecture
wrt execurting AVX-512 instructions on Zen5, but that's minor). AMD
also offers server CPUs with non-compact cores; interestingly, if we
compare CPUs with the same numbers of cores, the launch price (at the
same date) is not that far apart:

GHz
Model cores base boost cache TDP launch current
EPYC 9755 128 2.7 4.1 512MB 500W USD12984 EUR5979
EPYC 9745 128 2.3 3.7 256MB 400W USD12141 EUR4192

Current pricing from <https://geizhals.eu/?cat=cpuamdam4&xf=12099_Server~25_128~596_Turin~596_Turin+Dense>;
however, the third-cheapest dealer for the 9745 asks for EUR 6129, and
the cheapest price up to 2025-09-10 has been EUR 6149, so the current
price difference may be short-lived. The cheapest price for the 9755
was 4461 on 2025-08-25, and at that time the 9755 was cheaper than the
9745 (at least as far as the prices seen by the website above are
concerned).

I have thought about why the idea of more smaller cores has not been
more successful, at least for the kinds of loads where you have a
large number of independent and individually not particularly
demanding threads, as in web shops. My explanation is that you need
1) memory bandwidth and 2) interconnection with the rest of the
system.

The interconnection with the rest of the system probably does
not get much cheaper for the smaller cores, and probably becomes more
expensive with more cores (e.g., Intel switched from a ring to a grid
when they increased the cores in their server chips).

The bandwidth requirements to main memory for given cache sizes per
core reduce linearly with the performance of the cores; if the larger
number of smaller cores really leads to increased aggregate
performance, additional main memory bandwidth is needed, or you can
compensate for that with larger caches.

But to eliminate some variables, let's just consider the case where we
want to get the same performance with the same main memory bandwidth
from using more smaller cores than we use now. Will the resulting CPU
require less area? The cache sizes per core are not reduced, and
their area is not reduced much. The core itself will get smaller, and
its performance will also get smaller (although by less than the
core). But if you sum up the total per-core area (core, caches, and interconnect), at some point the per-core area reduces by less than
the per-core performance, so for a given amount of total performance,
the area goes up.

There is one counterargument to these considerations: The largest
configuration of Turin dense has less cache for more cores than the
largest configuration of Turin. I expect that's the reason why they
offer both; if you have less memory-intensive loads, Turin dense with
the additional cores will give you more performance, otherwise you
better buy Turin.

Also, Intel has added 16 E-Cores to their desktop chips without giving
them the same amount of caches as the P-Cores; e.g., in Arrow lake we
have

P-core 48KB D-L0 64KB I-L1 192KB D-L1 3MB L2 3MB L3/core
E-Core 32KB D-L1 64KB I-L1 4MB L2/4 cores 3MB L3/4cores

Here we don't have an alternative with more P-Cores and the same
bandwidth, so we cannot contrast the approaches. But it's certainly
the case that if you have a bandwidth-hungry load, you don't need to
buy the Arrow Lake with the largest number of E-Cores.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Fri Sep 19 14:33:44 2025

From Newsgroup: comp.arch

BGB <[email protected]> writes:

Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently
higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.

Like, if an ARM chip uses 1/30th the power, unless it is more than 30x >slower, it may still win in Perf/W and similar...

No TDP numbers are given for Oryon. For Apple's M4, the numbers are

M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W

Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.

[RISC-V]

recent proposals for indexed load/store and auto-increment popping up,

Where can I read about that.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Fri Sep 19 18:12:38 2025

From Newsgroup: comp.arch

On Fri, 19 Sep 2025 09:50:32 GMT
[email protected] (Anton Ertl) wrote:

I have thought about why the idea of more smaller cores has not been
more successful, at least for the kinds of loads where you have a
large number of independent and individually not particularly
demanding threads, as in web shops. My explanation is that you need
1) memory bandwidth and 2) interconnection with the rest of the
system.

The interconnection with the rest of the system probably does
not get much cheaper for the smaller cores, and probably becomes more expensive with more cores (e.g., Intel switched from a ring to a grid
when they increased the cores in their server chips).

That particualr problem is addressed by grouping smaller cores into
clusters with shared L2 cache. It's especially effective for scaling
when L2 cache is true inclusive relatively to underlying L1 caches.
The price is limited L2 bandwidth as seen by the cores.

BTW, I didn't find any info about replacement policy of Intel's Sierra
Forest L2 caches.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Fri Sep 19 15:05:56 2025

From Newsgroup: comp.arch

EricP <[email protected]> writes:

Anton Ertl wrote:

Thomas Koenig <[email protected]> writes:

BGB <[email protected]> schrieb:

Still sometimes it seems like it is only a matter of time until Intel or >>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>> hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.

For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.

That's nonsense; regulars of this groups should know better, at least
this nonsense has been corrected often enough. E.g., I wrote in
<[email protected]>:

|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions with
|around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop"
|standing for "RISC").

I don't know what you are objecting to

I am objecting to the claim that uops are RISC-like, and that there is
a translation to RISC occuring inside the CPU, and (not present here,
but often also claimed) that therefore there is no longer a difference
between RISC and non-RISC.

One can discuss the details, but at the end of the day, uops are some implementation-specific internals of the microarchitecture, whereas a
RISC architecture is an architecture.

The number of bits has nothing to do with what it is called.
If this uOp was for a ROB style design where all the knowledge about
each instruction including register ids, immediate data,
scheduling info, result data, status, is stored in a single ROB entry,
then 100 bits sounds pretty small so I'm guessing that was a 32-bit cpu.

Yes, P6 is the code name for the Pentium Pro, which has a ROB, and,
more importantly valued reservation stations, and yes, the 118 or
whatever bits include the operands. I have no idea how the P6 handles
its 80-bit FP with valued RSs; maybe it has bigger uops in its FP part
(but I think it has a unified scheduler, so that would not work out,
or maybe I miss something).

But concerning the discussion at hand: Containing the data is a
significant deviation from RISC instruction sets, and RISC
instructions are typically only 32 bits or 16 bits wide.

Another difference is that the OoO engine that sees the uOps performs
only a very small part of the functionality of branches, with the
majority performed by the front end. I.e., there is no branching in
the OoO engine that sees the uOps, at the most it confirms the branch
prediction, or diagnoses a misprediction, at which point the OoO
engine is out of a job and has to wait for the front end; possibly
only the ROB (which deals with instructions again) resolves the
misprediction and kicks the front end into action, however.

And a uOp triggers that action sequence.
I don't see the distinction you are trying to make.

The major point is that the OoO engine (the part that deals with uops)
sees a linear sequence of uops it has to process, with nearly all
actual branch processing (which an architecture has to do) done in a
part that does not deal with uops. With the advent of uop caches that
has changed a bit, but many of the CPUs for which the uop=RISC claim
has been made do not have an uop cache.

It's not entirely clear which parts of the
engine see MacroOps and ROPs, but my impression was that the MacroOps
are not split into ROPs for the largest part of the OoO engine.

AMD explains there terminology here but note that the relationship
between Macro-Ops and Micro-Ops is micro-architecture specific.

A Seventh-Generation x86 Microprocessor, 1999 >https://www.academia.edu/download/70925991/4.79985120211001-19357-4pufup.pdf

"An [micro-]OP is the minimum executable entity understood by the machine."
A macro-op is a bundle of 1 to 3 micro-ops.
Simple instructions map to 1 macro and 1-3 micro ops
and this mapping is done in the decoder.
Complex instructions map to one or more "micro-lines" each of which
consists of 3 macro-ops (of 1-3 micro-ops each) pulled from micro-code ROM.

Yes, so much is clear. It's not clear where Macro-Ops are in play and
where Micro-Ops are in play. Over time I get the impression that the
macro-ops are the main thing running through the OoO engine, and
Micro-Ops are only used in specific places, but it's completely
unclear to me where. E.g., if they let an RMW Macro-Op run through
the OoO engine, it would first go to the LSU for the address
generation, translation and load, then to the ALU for the
modification, then to the LSU for the store, and then to the ROB.
Where in this whole process is a Micro-Op actually stored?

This is a bit introductory level:

Book
Computer Organization and Design
The Hardware/Software Interface: RISC-V Edition, 2018
Patterson, Hennessy

Their "Computer Architecture" book is also revised every few years,
but their treatment of OoO makes me think that they are not at all
interested in that part anymore, instead more in, e.g., multiprocessor
memory subsystems.

And the fact that we see so few recent books on the topics makes me
think that many in academia have decided that this is a topic that
they leave to industry.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Fri Sep 19 16:14:53 2025

From Newsgroup: comp.arch

[email protected] (Anton Ertl) posted:

EricP <[email protected]> writes:

Anton Ertl wrote:

Thomas Koenig <[email protected]> writes:

BGB <[email protected]> schrieb:

-------------------------------

Yes, so much is clear. It's not clear where Macro-Ops are in play and
where Micro-Ops are in play. Over time I get the impression that the macro-ops are the main thing running through the OoO engine, and
Micro-Ops are only used in specific places, but it's completely
unclear to me where. E.g., if they let an RMW Macro-Op run through
the OoO engine, it would first go to the LSU for the address
generation, translation and load, then to the ALU for the
modification, then to the LSU for the store, and then to the ROB.
Where in this whole process is a Micro-Op actually stored?

In the reservation station.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Fri Sep 19 16:23:06 2025

From Newsgroup: comp.arch

[email protected] (Anton Ertl) posted:

BGB <[email protected]> writes:

On 9/17/2025 4:33 PM, John Levine wrote:

According to BGB <[email protected]>:

--------------------------------------

I have thought about why the idea of more smaller cores has not been
more successful, at least for the kinds of loads where you have a
large number of independent and individually not particularly
demanding threads, as in web shops. My explanation is that you need
1) memory bandwidth and 2) interconnection with the rest of the
system.

Yes, exactly:: if you have a large number of cores doing a performance of
X, they will need exactly the same memory BW as a smaller number of cores
also performing at X.

In addition, the interconnect has to be at least as good as the small core system.

The interconnection with the rest of the system probably does
not get much cheaper for the smaller cores, and probably becomes more expensive with more cores (e.g., Intel switched from a ring to a grid
when they increased the cores in their server chips).

The bandwidth requirements to main memory for given cache sizes per
core reduce linearly with the performance of the cores; if the larger
number of smaller cores really leads to increased aggregate
performance, additional main memory bandwidth is needed, or you can compensate for that with larger caches.

Sooner or later, you actually have to read/write main memory.

But to eliminate some variables, let's just consider the case where we
want to get the same performance with the same main memory bandwidth
from using more smaller cores than we use now. Will the resulting CPU require less area? The cache sizes per core are not reduced, and
their area is not reduced much.

A core running at ½ the performance can use a cache that is ¼ the size
and see the same percentage degradation WRT cache misses (as long as
main memory is equally latent). TLBs too.

The core itself will get smaller, and

12× smaller and 12× lower power

its performance will also get smaller (although by less than the
core).

for ½ the performance

But if you sum up the total per-core area (core, caches, and interconnect), at some point the per-core area reduces by less than
the per-core performance, so for a given amount of total performance,
the area goes up.

GBOoO Cores tend to be about the size of 512KB of L2

There is one counterargument to these considerations: The largest configuration of Turin dense has less cache for more cores than the
largest configuration of Turin. I expect that's the reason why they
offer both; if you have less memory-intensive loads, Turin dense with
the additional cores will give you more performance, otherwise you
better buy Turin.

Also, Intel has added 16 E-Cores to their desktop chips without giving
them the same amount of caches as the P-Cores; e.g., in Arrow lake we
have

P-core 48KB D-L0 64KB I-L1 192KB D-L1 3MB L2 3MB L3/core
E-Core 32KB D-L1 64KB I-L1 4MB L2/4 cores 3MB L3/4cores

Here we don't have an alternative with more P-Cores and the same
bandwidth, so we cannot contrast the approaches. But it's certainly
the case that if you have a bandwidth-hungry load, you don't need to
buy the Arrow Lake with the largest number of E-Cores.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Fri Sep 19 11:41:19 2025

From Newsgroup: comp.arch

On 9/19/2025 9:33 AM, Anton Ertl wrote:

BGB <[email protected]> writes:

Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.

Seems so...
Seems the CPU I am running as a 105W TDP, I had thought I remembered
150W, oh well...

Seems 150-200W is more Threadripper territory, and not the generic
desktop CPUs.

Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
slower, it may still win in Perf/W and similar...

No TDP numbers are given for Oryon. For Apple's M4, the numbers are

M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W

Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.

A lot of the ARM SoC's I had seen had lower TDPs, though more often with Cortex A53 or A55/A78 cores or similar:

Say (MediaTek MT6752):
https://unite4buy.com/cpu/MediaTek-MT6752/
Has a claimed TDP here of 7W and has 8x A53.

Or, for a slightly newer chip (2020):
https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j

TDP 5W, has A55 and A78 cores.

Some amount of the HiSilicon numbers look similar...

But, yeah, I guess if using these as data-points:
A55: ~ 5/8W, or ~ 0.625W (very crude)
Zen+: ~ 105/16W, ~ 6.56W

So, more like 10x here, but ...

Then, I guess it becomes a question of the relative performance
difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...

Judging based on my cellphone (with A53 cores), and previously running
my emulator in Termux, there is a performance difference, but nowhere
near 10x.

Probably need to set up a RasPi with a 64-bit OS at some point and see
how this performs... (wouldn't really be as accurate to compare x86-64
with 32-bit ARM).

[RISC-V]

recent proposals for indexed load/store and auto-increment popping up,

Where can I read about that.

For now, just on the mailing lists, eg: https://lists.riscv.org/g/tech-arch-review/message/368

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Fri Sep 19 12:00:07 2025

From Newsgroup: comp.arch

On 9/19/2025 4:50 AM, Anton Ertl wrote:

BGB <[email protected]> writes:

On 9/17/2025 4:33 PM, John Levine wrote:

According to BGB <[email protected]>:

Still sometimes it seems like it is only a matter of time until Intel or >>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>> hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.

That sounds a whole lot like what Transmeta did 25 years ago:

https://en.wikipedia.org/wiki/Transmeta_Crusoe

They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.

Might be different now:
25 years ago, Moore's law was still going strong, and the general
concern was more about maximizing scalar performance rather than energy
efficiency or core count (and, in those days, processors were generally
single-core).

IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
2012) has 8 cores. If IA-64 (and dynamically translating AMD64 to it)
would be a good idea nowadays, it would not have been canceled.

How should the number of cores change anything? If you cannot make single-threaded IA-32 or AMD64 programs run at competetive speeds on
IA-64 hardware, how would that inefficiency be eliminated in
multi-threaded programs?

Now we have a different situation:
Moore's law is dying off;

Even if that is the case, how should that change anything about the
relative merits of the two approaches?

Scalar CPU performance has hit a plateau;

True, but again, what's the relevance for the discussion at hand?

And, for many uses, performance is "good enough";

In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.

Possibly, it depends.

The question is what could Intel or AMD do if the wind blew in that
direction.

For the end-user, the experience is likely to look similar, so they
might not need to know/care if they are using some lower-power native
chip, or something that is internally running on a dynamic translator to
some likely highly specialized ISA.

A lot more software can make use of multi-threading;

Possible, but how would it change things?

Multi-threaded software does not tend to depend as much on single-thread performance as single threaded software...

Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance
on a comparably smaller and cheaper core, and with a somewhat better
"performance per watt" metric.

Evidence?

No hard numbers, but experience here:
ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
ARM11 cores).

The RasPi basically runs circles around the Eee...

Though, no good datapoints for fast x86 emulators here.
At least DOSBox and QEMU running x86 on RasPi tend to be dead slow.

( no time right now, so skipping rest )

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Fri Sep 19 12:38:51 2025

From Newsgroup: comp.arch

On 9/19/2025 12:00 PM, BGB wrote:

On 9/19/2025 4:50 AM, Anton Ertl wrote:

BGB <[email protected]> writes:

On 9/17/2025 4:33 PM, John Levine wrote:

According to BGB <[email protected]>:

Still sometimes it seems like it is only a matter of time until
Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>>> hardware level, but then pretends to still be an x86 chip by running >>>>> *everything* in a firmware level emulator via dynamic translation.

That sounds a whole lot like what Transmeta did 25 years ago:

https://en.wikipedia.org/wiki/Transmeta_Crusoe

They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.

Might be different now:
25 years ago, Moore's law was still going strong, and the general
concern was more about maximizing scalar performance rather than energy
efficiency or core count (and, in those days, processors were generally
single-core).

IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
2012) has 8 cores. If IA-64 (and dynamically translating AMD64 to it)
would be a good idea nowadays, it would not have been canceled.

How should the number of cores change anything? If you cannot make
single-threaded IA-32 or AMD64 programs run at competetive speeds on
IA-64 hardware, how would that inefficiency be eliminated in
multi-threaded programs?

Now we have a different situation:
   Moore's law is dying off;

Even if that is the case, how should that change anything about the
relative merits of the two approaches?

   Scalar CPU performance has hit a plateau;

True, but again, what's the relevance for the discussion at hand?

   And, for many uses, performance is "good enough";

In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.

Possibly, it depends.

The question is what could Intel or AMD do if the wind blew in that direction.

For the end-user, the experience is likely to look similar, so they
might not need to know/care if they are using some lower-power native
chip, or something that is internally running on a dynamic translator to some likely highly specialized ISA.

   A lot more software can make use of multi-threading;

Possible, but how would it change things?

Multi-threaded software does not tend to depend as much on single-thread performance as single threaded software...

Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance >>> on a comparably smaller and cheaper core, and with a somewhat better
"performance per watt" metric.

Evidence?

No hard numbers, but experience here:
ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
ARM11 cores).

The RasPi basically runs circles around the Eee...

Though, no good datapoints for fast x86 emulators here.
At least DOSBox and QEMU running x86 on RasPi tend to be dead slow.

( no time right now, so skipping rest )

Seems I have a little time still...

Did find this: https://browser.geekbench.com/v4/cpu/compare/2498562?baseline=2792960

Not an exact match, I think the Eee was running the Atom at a somewhat
lower clock speed; and this is vs a Pi3 vs original Pi.
The Pi3 having 4x A53 cores.

But, yeah, they are roughly matched on single thread performance when
the Atom has a clock-speed advantage.

Though, this seems to imply that they are more just "comparable" on the performance front, rather than Atom being significantly slower...

Would need to try to dig-out the Eee and re-test, assuming it still
works/etc.

--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Microbot
  Fri Sep 19 05:15:09 2025
  from Moore, Ok via Telnet
- Microbot
  Thu Sep 18 06:56:49 2025
  from Moore, Ok via Telnet
- Zenobyte
  Wed Sep 17 14:30:46 2025
  from San Juan, Pr via Telnet
- Microbot
  Wed Sep 17 08:25:14 2025
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,070
Nodes:	10 (0 / 10)
Uptime:	204:13:54
Calls:	13,736
Calls today:	1
Files:	186,968
D/L today:	3,109 files (997M bytes)
Messages:	2,420,058

Intel's Software Defined Super Cores

Who's Online

Recent Visitors

System Info