• Intel's Software Defined Super Cores

    From John Savard@[email protected] to comp.arch on Mon Sep 15 23:54:12 2025
    From Newsgroup: comp.arch

    When I saw a post about a new way to do OoO, I had thought it might be
    talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

    This is a sound idea, but one may not find enough opportunities to use it.

    Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@[email protected] to comp.arch on Tue Sep 16 00:03:51 2025
    From Newsgroup: comp.arch

    On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:

    Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.

    On further reflection, this may be equivalent to re-inventing out-of-order execution.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Mon Sep 15 17:19:36 2025
    From Newsgroup: comp.arch

    On 9/15/2025 4:54 PM, John Savard wrote:
    When I saw a post about a new way to do OoO, I had thought it might be talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

    This is a sound idea, but one may not find enough opportunities to use it.

    Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.

    Two weeks ago, I saw this in Tom's Hardware.

    https://www.tomshardware.com/pc-components/cpus/intel-patents-software-defined-supercore-mimicking-ultra-wide-execution-using-multiple-cores

    But at this point, it is just a patent. While it *might* get included
    in a future product, it seems a long way away, if ever.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Mon Sep 15 17:56:28 2025
    From Newsgroup: comp.arch

    On 9/15/2025 4:54 PM, John Savard wrote:
    When I saw a post about a new way to do OoO, I had thought it might be talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

    We would have to somehow tell the system that the program only uses a
    single thread, right? Not exactly sure how the sync is going to work
    with regard to multi-threaded and/or multi process programs?

    A single threaded program runs, then it calls into a function that
    creates a thread. Humm...


    This is a sound idea, but one may not find enough opportunities to use it.

    Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.

    Can one get something kind of akin to it by a clever use of affinity
    masks? But, those are not 100% guaranteed?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@[email protected] to comp.arch on Tue Sep 16 10:13:35 2025
    From Newsgroup: comp.arch

    Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

    Sounds like [multiscalar processors](doi:multiscalar processor)


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@[email protected] to comp.arch on Tue Sep 16 10:15:04 2025
    From Newsgroup: comp.arch

    Sounds like [multiscalar processors](doi:multiscalar processor)
    ^^^^^^^^^^^^^^^^^^^^^
    10.1145/223982.224451

    [ I guess it can be useful to actully look at what one pasts before
    pressing "send", eh? ]


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Tue Sep 16 15:10:09 2025
    From Newsgroup: comp.arch

    Stefan Monnier <[email protected]> schrieb:

    [ I guess it can be useful to actully look at what one pasts before
    pressing "send", eh? ]

    This is sooooo 2010's. Next, you'll be claming it makes sense to
    think before writing, and where would we be then? Not in the age
    of modern social media, that's for sure.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue Sep 16 15:50:38 2025
    From Newsgroup: comp.arch


    John Savard <[email protected]d> posted:

    When I saw a post about a new way to do OoO, I had thought it might be talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

    This is a sound idea, but one may not find enough opportunities to use it.

    Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add extra connections between cores to make it work.

    Andy Glew was working on stuff like this 10-15 years ago

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@[email protected] to comp.arch on Tue Sep 16 13:01:30 2025
    From Newsgroup: comp.arch

    On Tue, 16 Sep 2025 00:03:51 -0000 (UTC), John Savard <[email protected]d> wrote:

    On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:

    Although it's called "inverse hyperthreading", this technique could be
    combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.

    On further reflection, this may be equivalent to re-inventing out-of-order >execution.

    John Savard

    Sounds more like dynamic micro-threading.

    Over the years I've seen a handful of papers about compile time micro-threading: that is the compiler itself identifies separable
    dependency chains in serial code and rewrites them into deliberate
    threaded code to be executed simultaneously.

    It is not easy to do under the best of circumstances and I've never
    seen anything about doing it dynamically at run time.

    To make a thread worth rehosting to another core, it would need to be
    (at least) many 10s of instructions in length. To figure this out
    dynamically at run time, it seems like you'd need the decode window to
    be 1000s of instructions and a LOT of "figure-it-out" circuitry.


    MMV, but to me it doesn't seem worth the effort.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Wed Sep 17 11:54:09 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    John Savard <[email protected]d> posted:

    When I saw a post about a new way to do OoO, I had thought it might be
    talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- >> intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting
    programs into chunks that can be performed in parallel on different cores, >> where the cores are intimately connected in order to make this work.

    This is a sound idea, but one may not find enough opportunities to use it. >>
    Although it's called "inverse hyperthreading", this technique could be
    combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.

    Andy Glew was working on stuff like this 10-15 years ago

    That's what immediately fell to my mind as well, it looks a lot like
    trying some of his ideas about scouting micro-threads, doing work in the
    hope that it will turn out useful.

    To me it sounds like it is related to eager execution, except skipping
    further forward into upcoming code.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Wed Sep 17 14:34:09 2025
    From Newsgroup: comp.arch

    On Wed, 17 Sep 2025 11:54:09 +0200
    Terje Mathisen <[email protected]> wrote:

    MitchAlsup wrote:

    John Savard <[email protected]d> posted:

    When I saw a post about a new way to do OoO, I had thought it
    might be talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core-
    intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by
    splitting programs into chunks that can be performed in parallel
    on different cores, where the cores are intimately connected in
    order to make this work.

    This is a sound idea, but one may not find enough opportunities to
    use it.

    Although it's called "inverse hyperthreading", this technique
    could be combined with SMT - put the chunks into different threads
    on the same core, rather than on different cores, and then one
    wouldn't need to add extra connections between cores to make it
    work.

    Andy Glew was working on stuff like this 10-15 years ago

    That's what immediately fell to my mind as well, it looks a lot like
    trying some of his ideas about scouting micro-threads, doing work in
    the hope that it will turn out useful.

    To me it sounds like it is related to eager execution, except
    skipping further forward into upcoming code.

    Terje



    The question is what is most likely meaning of the fact of patenting?
    IMHO, it means that they explored the idea and decided against going in
    this particular direction in the near and medium-term future.

    I think that when Intel actually plans to use particular idea then they
    keep the idea secret for as long as they can and either don't patent at
    all or apply for patent after release of the product.
    I can be wrong about it.

    On the other hand,
    Some of the people that issued the patent appear to be leading figures
    in Intel's P-core teams. Some of them 1 year ago gave representations
    about advantages of removal of SMT. Removal of SMT and this super-core
    idea can be considered complimentary - both push into direction of
    cores with smaller # of EU pipes. So, may be, an idea was seriously
    considered for Intel products in mid-term future.
    Anyway, couple of months ago Tan himself said that Intel is reversing
    the decision to remove SMT. Which probably means that all their mid-term
    future plans are undergoing significant changes.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Wed Sep 17 13:46:33 2025
    From Newsgroup: comp.arch

    Michael S <[email protected]> writes:
    The question is what is most likely meaning of the fact of patenting?
    IMHO, it means that they explored the idea and decided against going in
    this particular direction in the near and medium-term future.

    I think that when Intel actually plans to use particular idea then they
    keep the idea secret for as long as they can and either don't patent at
    all or apply for patent after release of the product.
    I can be wrong about it.

    That would risk that somebody without patent exchange agreements with
    Intel patents the invention first (whether independently developed or
    due to a leak). Advantages of such a strategy: Companies with patent
    exchange agreements learn even later about the invention, and the
    patent expires at a later date.

    I remember an article about alias prediction (IIRC for executing
    stores before architecturally earlier loads), where the author read a
    patent from Intel and did some measurements on a released Intel CPU,
    and confirming that they actually implemented what the patent
    described.

    If you find that article, and compare the data when the patent was
    submitted to the date of the release of the processor, you can check
    your theory.

    Some of them 1 year ago gave representations
    about advantages of removal of SMT.

    I did not read any accounts of that that appeared particularly
    knowledgeable. What are the advantages, or where can I read about
    these presentations?

    Removal of SMT and this super-core
    idea can be considered complimentary - both push into direction of
    cores with smaller # of EU pipes.

    What do you mean by that? Narrower cores? In recent years cores seem
    to have exploded in width. From 1995 up to and including 2018 Intel
    produced 3-wide and 4-wide designs (with 4-wide coming IIRC with Sandy
    Bridge in 2011), and since then even the Skymont E-core has grown to
    8-wide, with 26 execution ports and 16-wide retirement. And other CPU manufacturers have also increased the widths of their CPUs.

    It seems that there has been a breakthrough in extracting ILP, making
    wider cores pay off better, a breakthrough in designing wider register
    renamers and making other structures wider, or both.

    Pushing for narrower cores appears unplausible to me at this stage.

    Concerning the removal of SMT, I can only guess, but that did not
    appear unplausible to me with Intel's hybrid CPUs: They have P-cores
    for fast single-thread performance, and lots of E-cores for
    multi-thread performance. You allocate threads that need
    single-thread performance to P-cores and threads that don't to
    E-cores. If you have even more tasks, i.e., a heavily multi-threaded
    load, do you want to slow down the threads that run on the P-cores by
    switching them to SMT mode, also increasing the already-high power
    consumption of the P-cores, lowering the clock of everything to stay
    within the power limit, and thus possibly the performance? If not,
    you don't need SMT.

    Still, after touting the SMT horn for so long, I don't expect that
    such considerations are the only ones. There must be a significant
    advantage in design complexity or die area when leaving it away
    (contradicting the earlier claim that SMT costs very little).

    Concerning super cores, whatever it is, my guess is that the idea is
    to try to extract even more performance from (as far as software is
    concerned) single-threaded programs than achievable with the wide
    cores of today.

    Anyway, couple of months ago Tan himself said that Intel is reversing
    the decision to remove SMT.

    On the servers, they do not follow the hybrid strategy, for whatever
    reason, so the thoughts above don't apply there. And maybe they found
    that the cloud providers want SMT, in order to sell their customers
    twice as many "CPUs".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Wed Sep 17 13:07:49 2025
    From Newsgroup: comp.arch

    On 9/15/2025 6:54 PM, John Savard wrote:
    When I saw a post about a new way to do OoO, I had thought it might be talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

    This is a sound idea, but one may not find enough opportunities to use it.

    Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.


    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

    Say, more cores and less power use, at the possible expense of some
    amount of performance.

    ...


    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Wed Sep 17 18:53:24 2025
    From Newsgroup: comp.arch

    BGB <[email protected]> schrieb:

    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.

    See "The Anatomy of a High-Performance Microprocessor: A Systems
    Perspective" by Bruce Shriver and Bennett Smith.

    For a later perspective, see

    https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Sep 17 18:54:01 2025
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    On 9/15/2025 6:54 PM, John Savard wrote:
    When I saw a post about a new way to do OoO, I had thought it might be talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.

    This is a sound idea, but one may not find enough opportunities to use it.

    Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add extra connections between cores to make it work.


    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

    Transmeta tried and failed to do this.

    Say, more cores and less power use, at the possible expense of some
    amount of performance.

    ...


    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Wed Sep 17 23:00:15 2025
    From Newsgroup: comp.arch

    On Wed, 17 Sep 2025 18:53:24 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    BGB <[email protected]> schrieb:

    Still sometimes it seems like it is only a matter of time until
    Intel or AMD releases a new CPU that just sort of jettisons x86
    entirely at the hardware level, but then pretends to still be an
    x86 chip by running *everything* in a firmware level emulator via
    dynamic translation.

    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.


    Not really.

    First, translation on the fly does not count.

    Second, even for translation on the fly, only ancient K6 worked that
    way. Their later chip did a lot of work at the level of macro-ops,
    which in majority of cases have one-to-one correspondence to original
    x86 load-op and load-op-store instructions.

    Actually, I am not 100% sure about Bulldozer and derivatives, but K7,
    K8 and all generations of Zen are using macro-ops.

    See "The Anatomy of a High-Performance Microprocessor: A Systems
    Perspective" by Bruce Shriver and Bennett Smith.


    Badly outdated text.

    For a later perspective, see

    https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Wed Sep 17 20:19:14 2025
    From Newsgroup: comp.arch

    BGB <[email protected]> writes:
    On 9/15/2025 6:54 PM, John Savard wrote:
    When I saw a post about a new way to do OoO, I had thought it might be
    talking about this:

    https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- >> intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough

    Basically, Intel proposes to boost single-thread performance by splitting
    programs into chunks that can be performed in parallel on different cores, >> where the cores are intimately connected in order to make this work.

    This is a sound idea, but one may not find enough opportunities to use it. >>
    Although it's called "inverse hyperthreading", this technique could be
    combined with SMT - put the chunks into different threads on the same
    core, rather than on different cores, and then one wouldn't need to add
    extra connections between cores to make it work.


    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.

    That was tried three decades ago. https://en.wikipedia.org/wiki/Transmeta


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Wed Sep 17 21:33:17 2025
    From Newsgroup: comp.arch

    According to BGB <[email protected]>:
    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.

    That sounds a whole lot like what Transmeta did 25 years ago:

    https://en.wikipedia.org/wiki/Transmeta_Crusoe

    They failed but perhaps things are different now. Their
    native architecture was VLIW which might have been part
    of the problem.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Thu Sep 18 05:27:15 2025
    From Newsgroup: comp.arch

    BGB <[email protected]> writes:
    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.

    Intel has already done so, although AFAIK not at the firmware level:
    Every IA-64 CPU starting with the Itanium II did not implement IA-32
    in hardware (unlike the Itanium), but instead used dynamic translation.

    There is no reason for Intel to repeat this mistake, or for anyone
    else to go there, either.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Thu Sep 18 05:31:29 2025
    From Newsgroup: comp.arch

    Thomas Koenig <[email protected]> writes:
    BGB <[email protected]> schrieb:

    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the
    hardware level, but then pretends to still be an x86 chip by running
    *everything* in a firmware level emulator via dynamic translation.

    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.

    That's nonsense; regulars of this groups should know better, at least
    this nonsense has been corrected often enough. E.g., I wrote in <[email protected]>:

    |Not even if the microcode the Intel and AMD chips used was really
    |RISC-like, which it was not (IIRC the P6 uses micro-instructions with
    |around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop" |standing for "RISC").

    Another difference is that the OoO engine that sees the uOps performs
    only a very small part of the functionality of branches, with the
    majority performed by the front end. I.e., there is no branching in
    the OoO engine that sees the uOps, at the most it confirms the branch prediction, or diagnoses a misprediction, at which point the OoO
    engine is out of a job and has to wait for the front end; possibly
    only the ROB (which deals with instructions again) resolves the
    misprediction and kicks the front end into action, however.

    As Mitch Alsup has written, AMD has its MacroOps (load-op and RMW) in
    addition to the Rops. It's not entirely clear which parts of the
    engine see MacroOps and ROPs, but my impression was that the MacroOps
    are not split into ROPs for the largest part of the OoO engine.

    See "The Anatomy of a High-Performance Microprocessor: A Systems
    Perspective" by Bruce Shriver and Bennett Smith.

    From 1998. Unfortunately, there are not many more recent books about
    the microarchitecture of OoO CPUs. What I have found:

    Modern Processor Design: Fundamentals of Superscalar Processors
    John Paul Shen, Mikko H. Lipasti
    McGraw-Hill
    656 pages
    published 2004 or so (don't let the 2013 date from the reprint fool you) Discusses CPU design (not just OoO) using various real CPUs from the
    1990s as example.

    Processor Microarchitecture -- An Implementation Perspective
    Antonio Gonzalez , Fernando Latorre , Grigorios Magklis
    Springer
    published 2010
    Relatively short, discusses the various parts of an OoO CPU and how to implement them.

    Henry Wong
    A Superscalar Out-of-Order x86 Soft Processor for FPGA
    Ph.D. thesis, U. Toronto
    https://www.stuffedcow.net/files/henry-thesis-phd.pdf
    Slides: https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf published 2017

    A problem is that the older books don't cover recent developments such
    as alias prediction and that Wong was limited by what a single person
    can do (his work was not part of a larger research project at
    U. Toronto), as well as what fits into an FPGA.

    BTW, Wong's work can be seen as a refutation of BGB's statement: He
    chose to implement IA-32; on slide 14 of <https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf> he
    states "It’s easy to implement!".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Thu Sep 18 06:14:30 2025
    From Newsgroup: comp.arch

    John Levine <[email protected]> writes: >https://en.wikipedia.org/wiki/Transmeta_Crusoe

    They failed but perhaps things are different now. Their
    native architecture was VLIW which might have been part
    of the problem.

    It definitely was. However, even a modern high-performance OoO cores
    like Apple M1-M4's P-cores or on Qualcomm's Oryon, the performance of dynamically-translated AMD64 code is usually slower than on comparable
    CPUs from Intel and AMD.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Thu Sep 18 03:39:57 2025
    From Newsgroup: comp.arch

    On 9/17/2025 4:33 PM, John Levine wrote:
    According to BGB <[email protected]>:
    Still sometimes it seems like it is only a matter of time until Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the
    hardware level, but then pretends to still be an x86 chip by running
    *everything* in a firmware level emulator via dynamic translation.

    That sounds a whole lot like what Transmeta did 25 years ago:

    https://en.wikipedia.org/wiki/Transmeta_Crusoe

    They failed but perhaps things are different now. Their
    native architecture was VLIW which might have been part
    of the problem.


    Might be different now:
    25 years ago, Moore's law was still going strong, and the general
    concern was more about maximizing scalar performance rather than energy efficiency or core count (and, in those days, processors were generally single-core).


    Now we have a different situation:
    Moore's law is dying off;
    Scalar CPU performance has hit a plateau;
    And, for many uses, performance is "good enough";
    A lot more software can make use of multi-threading;
    ...


    Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
    well, whereas something like a RISC style ISA can get better performance
    on a comparably smaller and cheaper core, and with a somewhat better "performance per watt" metric.


    So, one possibility could be, rather than a small number of big/fast
    cores (either VLIW or OoO), possibly a larger number of smaller cores.

    The cores could maybe be LIW or in-order RISC.




    One possibility could be that virtual processors don't run on a single
    core, say:
    The logical cores exist more as VMs each running a virtual x86 processor
    core;
    The dynamic translation doesn't JIT translate to a linear program.

    Say:
    Breaks code into traces;
    Each trace uses something akin to CSP mixed with Pi-Calculus;
    Address translation is explicit in the ISA, with specialized ISA level memory-ordering and control-flow primitives.

    For example, there could be special ISA level mechanisms for submitting
    a job to a local job-queue, and pulling a job from the queue.
    Memory accesses could use a special "perform a memory access or branch-subroutine" instruction ("MEMorBSR"), where the MEMorBSR
    operations will try to access memory, either continuing to the next instruction (success) or Branching-to-Subroutine (access failed).

    Where the failure cases could include (but not limited to) TLB miss;
    access fault; memory ordering fault; ...

    The "memory ordering fault" case could be, when traces are submitted to
    the queue, if they access memory, they are assigned sequence numbers
    based on Load and Store operations. When memory is accessed, the memory
    blocks in the cache could be marked with sequence numbers when read or modified. On access, it could detect if/when memory access have
    out-of-order sequence numbers, and then fall back to special-case
    handling to restore the intended order (reverting any "uncommitted"
    writes, and putting the offending blocks back into the queue to be
    re-run after the preceding blocks have finished).

    Possibly, the caches wouldn't directly commit stores to memory, but
    instead could keep track of a group of cache lines as an "in-flight" transaction. In this case, it could be possible for a "logically older"
    block to see the memory as it was before a more recent transaction, but
    an out-of-order write could be detected via sequence numbers (if seen,
    it would mean a "future" block had run but had essentially read stale data).

    Once a block is fully committed (after all preceding blocks are
    finished) its contents can be written back out to main RAM.
    Could be held in an area of RAM local to the group of cores running the logical core.

    Possibly, such a core might actually operate in multiple address spaces:
    Virtual Memory, via the transaction oriented MEMorBSR mechanism;
    There would likely be an explicit TLB here.
    So, TLB Miss handling could be essentially a runtime call.
    Local Memory:
    Physical Address, small non-externally-visible SRAM;
    Divided into Core-Local and Group-Shared areas;
    Physical Memory:
    External DRAM or similar;
    Resembles more traditional RAM access (via Load/Store Ops);
    Could be used for VM tasks and page-table walks.


    Would likely require significant hardware level support for things like job-queues and synchronization mechanisms.

    One possibility could be that some devices could exist local to a group
    of cores, which then have a synchronous "first come, first serve" access pattern (possibly similar to how my existing core design manages MMIO).

    Possibly it could work by passing fixed-size messages over a bus, with
    each request/response pair to a device being synchronous.


    Possibly the JIT could try to infer possible memory aliasing between
    traces, and enforce sequential ordering if alias is likely. This because performing the operations in the correct order the first time is likely
    to be cheaper than detecting an ordering violation and rolling back a transaction.

    Whereas proving that traces can't alias is likely to be a much harder
    problem than inferring a probable absence of aliasing. If no order
    violations occur during operation, it can be safely assumed that no
    memory aliasing happened.

    Maintaining transactions would complicate the cache design though, since
    now there is a problem that the cache line can't be written back or
    evicted until its write-associated sequence is fully committed.

    Might also need to be separate queue spots for "tasks currently being
    worked on" vs "to be done after the current jobs are done". Say, for
    example, if a job needs to be rolled-back and re-run, it would still
    need to come before jobs that are further in the future relative to itself.

    Unlike memory, register ordering is easier to infer statically, at least
    in the absence of dynamic branching.

    Might need to enforce ordering in cases where:
    Dynamic branch occurs and the path can't be followed statically;
    A following trace would depend on a register modified in a preceding trace;
    ...



    As for how viable any of this is, I don't know...

    The VM could be a lot simpler if one assumes a single threaded VM.


    Also unclear is if an ISA could be designed in a way to keep overheads
    low enough (would be a waste if the multi-threaded VM is slower than a
    single threaded VM would have been). But, this would require a lot of
    exotic mechanisms, so dunno...

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Thu Sep 18 03:58:16 2025
    From Newsgroup: comp.arch

    On 9/18/2025 1:14 AM, Anton Ertl wrote:
    John Levine <[email protected]> writes:
    https://en.wikipedia.org/wiki/Transmeta_Crusoe

    They failed but perhaps things are different now. Their
    native architecture was VLIW which might have been part
    of the problem.

    It definitely was. However, even a modern high-performance OoO cores
    like Apple M1-M4's P-cores or on Qualcomm's Oryon, the performance of dynamically-translated AMD64 code is usually slower than on comparable
    CPUs from Intel and AMD.


    But, AFAIK the ARM cores tend to use significantly less power when
    emulating x86 than a typical Intel or AMD CPU, even if slower.

    Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

    Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
    slower, it may still win in Perf/W and similar...

    Then there is also Perf/$, and if such a CPU can win in both Perf/W and Perf/$, then it can still win even if it is slower, by throwing more
    cores at the problem.


    Though, the possibly interesting idea could be trying for a
    multi-threaded translation rather than a single threaded translation.
    But, to have any hope, a multi-threaded translation is likely to need
    exotic ISA features; whereas a single threaded VM could probably run
    mostly OK on normal ARM or RISC-V or similar (well, assuming a world
    where RiSC-V addresses some more of its weak areas; but then again, with recent proposals for indexed load/store and auto-increment popping up,
    this is starting to look more likely...).


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Thu Sep 18 17:51:36 2025
    From Newsgroup: comp.arch

    On Thu, 18 Sep 2025 05:27:15 GMT
    [email protected] (Anton Ertl) wrote:

    BGB <[email protected]> writes:
    Still sometimes it seems like it is only a matter of time until
    Intel or AMD releases a new CPU that just sort of jettisons x86
    entirely at the hardware level, but then pretends to still be an x86
    chip by running *everything* in a firmware level emulator via
    dynamic translation.

    Intel has already done so, although AFAIK not at the firmware level:
    Every IA-64 CPU starting with the Itanium II did not implement IA-32
    in hardware (unlike the Itanium), but instead used dynamic
    translation.


    That's imprecise.
    First couple of generations of Itanium 2 (McKinley, Madison) still had
    IA-32 hardware. Gone in Montecito (2006).
    Dynamic translation of application code was available much earlier,
    indeed, but early removal of [crappy] hardware colution was probably
    considered too risky.



    There is no reason for Intel to repeat this mistake, or for anyone
    else to go there, either.

    - anton

    As said by just about everybody, BGB's proposal is most similar
    to Transmeta. What was not said by everybody is that similar approach
    was tried for Arm, by NVidia none the less. https://en.wikipedia.org/wiki/Project_Denver



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Thu Sep 18 16:16:54 2025
    From Newsgroup: comp.arch


    Thomas Koenig <[email protected]> posted:

    BGB <[email protected]> schrieb:

    Still sometimes it seems like it is only a matter of time until Intel or AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.

    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.

    With a very loose definition of RISC::

    a)Does a RISC ISA contain memory reference address generation from
    the pattern [Rbase+Rindex<<scale+Displacement] ??
    Some will argue yes, others no.

    b) does a RISC ISA contain memory reference instructions that are
    combined with arithmetic calculations ??
    Some will argue yes, others no.

    c) does a RISC ISA contain memory reference instructions that
    access memory twice ??? LD-OP-ST :: but the TLB only once ?!?
    Most would argue no.

    Yet, this is the µISA of K7 and K8. It is only RISC in the very
    loosest sense of the word.

    And do not get me started on the trap/exception/interrupt model.


    See "The Anatomy of a High-Performance Microprocessor: A Systems
    Perspective" by Bruce Shriver and Bennett Smith.

    For a later perspective, see

    https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Thu Sep 18 12:33:44 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    Thomas Koenig <[email protected]> writes:
    BGB <[email protected]> schrieb:

    Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
    hardware level, but then pretends to still be an x86 chip by running
    *everything* in a firmware level emulator via dynamic translation.
    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.

    That's nonsense; regulars of this groups should know better, at least
    this nonsense has been corrected often enough. E.g., I wrote in <[email protected]>:

    |Not even if the microcode the Intel and AMD chips used was really |RISC-like, which it was not (IIRC the P6 uses micro-instructions with |around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop" |standing for "RISC").

    I don't know what you are objecting to - Intel calls its internal
    instructions micro-operations or uOps, and AMD calls its Rops.
    The term is widely used to mean something that executes internally.
    Beyond that it depends on the specific of each micro-architecture.

    The number of bits has nothing to do with what it is called.
    If this uOp was for a ROB style design where all the knowledge about
    each instruction including register ids, immediate data,
    scheduling info, result data, status, is stored in a single ROB entry,
    then 100 bits sounds pretty small so I'm guessing that was a 32-bit cpu.

    Another difference is that the OoO engine that sees the uOps performs
    only a very small part of the functionality of branches, with the
    majority performed by the front end. I.e., there is no branching in
    the OoO engine that sees the uOps, at the most it confirms the branch prediction, or diagnoses a misprediction, at which point the OoO
    engine is out of a job and has to wait for the front end; possibly
    only the ROB (which deals with instructions again) resolves the
    misprediction and kicks the front end into action, however.

    And a uOp triggers that action sequence.
    I don't see the distinction you are trying to make.

    As Mitch Alsup has written, AMD has its MacroOps (load-op and RMW) in addition to the Rops. It's not entirely clear which parts of the
    engine see MacroOps and ROPs, but my impression was that the MacroOps
    are not split into ROPs for the largest part of the OoO engine.

    AMD explains there terminology here but note that the relationship
    between Macro-Ops and Micro-Ops is micro-architecture specific.

    A Seventh-Generation x86 Microprocessor, 1999 https://www.academia.edu/download/70925991/4.79985120211001-19357-4pufup.pdf

    "An [micro-]OP is the minimum executable entity understood by the machine."
    A macro-op is a bundle of 1 to 3 micro-ops.
    Simple instructions map to 1 macro and 1-3 micro ops
    and this mapping is done in the decoder.
    Complex instructions map to one or more "micro-lines" each of which
    consists of 3 macro-ops (of 1-3 micro-ops each) pulled from micro-code ROM.

    See "The Anatomy of a High-Performance Microprocessor: A Systems
    Perspective" by Bruce Shriver and Bennett Smith.

    From 1998. Unfortunately, there are not many more recent books about
    the microarchitecture of OoO CPUs. What I have found:

    Modern Processor Design: Fundamentals of Superscalar Processors
    John Paul Shen, Mikko H. Lipasti
    McGraw-Hill
    656 pages
    published 2004 or so (don't let the 2013 date from the reprint fool you) Discusses CPU design (not just OoO) using various real CPUs from the
    1990s as example.

    Processor Microarchitecture -- An Implementation Perspective
    Antonio Gonzalez , Fernando Latorre , Grigorios Magklis
    Springer
    published 2010
    Relatively short, discusses the various parts of an OoO CPU and how to implement them.

    Henry Wong
    A Superscalar Out-of-Order x86 Soft Processor for FPGA
    Ph.D. thesis, U. Toronto https://www.stuffedcow.net/files/henry-thesis-phd.pdf
    Slides: https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf published 2017

    A problem is that the older books don't cover recent developments such
    as alias prediction and that Wong was limited by what a single person
    can do (his work was not part of a larger research project at
    U. Toronto), as well as what fits into an FPGA.

    BTW, Wong's work can be seen as a refutation of BGB's statement: He
    chose to implement IA-32; on slide 14 of <https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf> he
    states "It’s easy to implement!".

    - anton

    Other micro-architecture related sources since 2000:

    Book
    A Primer on Memory Consistency and Cache Coherence 2nd Ed, 2020
    Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood

    Dissertation
    Complexity and Correctness of a Super-Pipelined Processor, 2005
    Jochen Prei�

    Book
    General-Purpose Graphics Processor Architectures, 2018
    Aamodt, Wai Lun Fung, Rogers

    Book
    Microprocessor Architecture
    From Simple Pipelines to Chip Multiprocessors, 2010
    Jean-Loup Baer

    Book
    Processor Microarchitecture An Implementation Perspective, 2011
    Antonio Gonz�lez, Fernando Latorre, and Grigorios Magklis

    This is a bit introductory level:

    Book
    Computer Organization and Design
    The Hardware/Software Interface: RISC-V Edition, 2018
    Patterson, Hennessy


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Thu Sep 18 20:26:29 2025
    From Newsgroup: comp.arch

    On Thu, 18 Sep 2025 12:33:44 -0400
    EricP <[email protected]> wrote:

    Anton Ertl wrote:
    Thomas Koenig <[email protected]> writes:
    BGB <[email protected]> schrieb:

    Still sometimes it seems like it is only a matter of time until
    Intel or AMD releases a new CPU that just sort of jettisons x86
    entirely at the hardware level, but then pretends to still be an
    x86 chip by running *everything* in a firmware level emulator via
    dynamic translation.
    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.

    That's nonsense; regulars of this groups should know better, at
    least this nonsense has been corrected often enough. E.g., I wrote
    in <[email protected]>:

    |Not even if the microcode the Intel and AMD chips used was really |RISC-like, which it was not (IIRC the P6 uses micro-instructions
    with |around 100bits, and the K7 has a read-write Rop (with the "R"
    of "Rop" |standing for "RISC").

    I don't know what you are objecting to - Intel calls its internal instructions micro-operations or uOps, and AMD calls its Rops.


    No, they don't. They stopped using term Rops almost 25 years ago.
    If they used it in early K7 manuals then it was due to inertia (K6
    manuals copy&pasted without much of thought given) and partly because
    of marketing, because RISC was considered cool.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Thu Sep 18 14:42:36 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Thu, 18 Sep 2025 12:33:44 -0400
    EricP <[email protected]> wrote:

    Anton Ertl wrote:
    Thomas Koenig <[email protected]> writes:
    BGB <[email protected]> schrieb:

    Still sometimes it seems like it is only a matter of time until
    Intel or AMD releases a new CPU that just sort of jettisons x86
    entirely at the hardware level, but then pretends to still be an
    x86 chip by running *everything* in a firmware level emulator via
    dynamic translation.
    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.
    That's nonsense; regulars of this groups should know better, at
    least this nonsense has been corrected often enough. E.g., I wrote
    in <[email protected]>:

    |Not even if the microcode the Intel and AMD chips used was really
    |RISC-like, which it was not (IIRC the P6 uses micro-instructions
    with |around 100bits, and the K7 has a read-write Rop (with the "R"
    of "Rop" |standing for "RISC").
    I don't know what you are objecting to - Intel calls its internal
    instructions micro-operations or uOps, and AMD calls its Rops.


    No, they don't. They stopped using term Rops almost 25 years ago.
    If they used it in early K7 manuals then it was due to inertia (K6
    manuals copy&pasted without much of thought given) and partly because
    of marketing, because RISC was considered cool.

    And the fact that all the RISC processors ran rings around the CISC ones.
    So they wanted to promote that "hey, we can go fast too!"

    Ok, AMD dropped the "risc" prefix 25 years ago.
    That didn't change the way it works internally.

    They still use the term "micro op" in the Intel and AMD Optimization guides.
    It still means an micro-architecture specific internal simple, discrete
    unit of execution, albeit a more complex one as transistor budgets allow.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Thu Sep 18 14:05:04 2025
    From Newsgroup: comp.arch

    On 9/18/2025 11:16 AM, MitchAlsup wrote:

    Thomas Koenig <[email protected]> posted:

    BGB <[email protected]> schrieb:

    Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
    hardware level, but then pretends to still be an x86 chip by running
    *everything* in a firmware level emulator via dynamic translation.

    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.

    With a very loose definition of RISC::

    a)Does a RISC ISA contain memory reference address generation from
    the pattern [Rbase+Rindex<<scale+Displacement] ??
    Some will argue yes, others no.

    b) does a RISC ISA contain memory reference instructions that are
    combined with arithmetic calculations ??
    Some will argue yes, others no.

    c) does a RISC ISA contain memory reference instructions that
    access memory twice ??? LD-OP-ST :: but the TLB only once ?!?
    Most would argue no.

    Yet, this is the µISA of K7 and K8. It is only RISC in the very
    loosest sense of the word.

    And do not get me started on the trap/exception/interrupt model.



    Still reminds me of the LOL of some of the old marketing for the TI
    MSP430 trying to pass it off as RISC:
    In practice has variable-length instructions (via @PC+ addressing);
    Has auto-increment addressing modes and similar;
    Most instructions can operate directly on memory;
    Has ability to do Mem/Mem operations;
    ...

    In effect, MSP430 being closer to the DEC PDP-11 than it was to much of anything else in the RISC family.

    Even SuperH, which also branched off from similar origins, had gone over
    to purely 16-bit instructions, and was Load/Store, so more deserving of
    the RISC title (though apparently still a lot more PDP-11 flavored than
    MIPS flavored).


    Their rationale: "But our instruction listing isn't very long, so RISC", nevermind all of the edge cases they hid off in the various addressing
    modes and register combinations.

    But, yeah, following similar logic to what TI was using, one could look
    at something like the Motorola 68000 and be all like, "Yep, looks like
    RISC to me"...


    ...




    See "The Anatomy of a High-Performance Microprocessor: A Systems
    Perspective" by Bruce Shriver and Bennett Smith.

    For a later perspective, see

    https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Thu Sep 18 22:56:22 2025
    From Newsgroup: comp.arch

    On Thu, 18 Sep 2025 14:42:36 -0400
    EricP <[email protected]> wrote:

    Michael S wrote:
    On Thu, 18 Sep 2025 12:33:44 -0400
    EricP <[email protected]> wrote:

    Anton Ertl wrote:
    Thomas Koenig <[email protected]> writes:
    BGB <[email protected]> schrieb:

    Still sometimes it seems like it is only a matter of time until
    Intel or AMD releases a new CPU that just sort of jettisons x86
    entirely at the hardware level, but then pretends to still be an
    x86 chip by running *everything* in a firmware level emulator
    via dynamic translation.
    For AMD, that has happend already a few decades ago; they
    translate x86 code into RISC-like microops.
    That's nonsense; regulars of this groups should know better, at
    least this nonsense has been corrected often enough. E.g., I
    wrote in <[email protected]>:

    |Not even if the microcode the Intel and AMD chips used was really
    |RISC-like, which it was not (IIRC the P6 uses micro-instructions
    with |around 100bits, and the K7 has a read-write Rop (with the
    "R" of "Rop" |standing for "RISC").
    I don't know what you are objecting to - Intel calls its internal
    instructions micro-operations or uOps, and AMD calls its Rops.


    No, they don't. They stopped using term Rops almost 25 years ago.
    If they used it in early K7 manuals then it was due to inertia (K6
    manuals copy&pasted without much of thought given) and partly
    because of marketing, because RISC was considered cool.

    And the fact that all the RISC processors ran rings around the CISC
    ones.

    In 1988. In 1998 - much less so.

    So they wanted to promote that "hey, we can go fast too!"

    Ok, AMD dropped the "risc" prefix 25 years ago.
    That didn't change the way it works internally.


    Of course, they did. Several times.
    Even Zen3 works non-trivially differently from Zen1 and 2.
    If you stopped following in previous millenium it's your problem rather
    than their.

    They still use the term "micro op" in the Intel and AMD Optimization
    guides. It still means an micro-architecture specific internal
    simple, discrete unit of execution, albeit a more complex one as
    transistor budgets allow.


    By that logic every CISC is RISC, because at some internal level they
    executes simple operations. Even those with load-ALU pipeline do load
    and ALU at separate stages.







    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Fri Sep 19 09:50:32 2025
    From Newsgroup: comp.arch

    BGB <[email protected]> writes:
    On 9/17/2025 4:33 PM, John Levine wrote:
    According to BGB <[email protected]>:
    Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
    hardware level, but then pretends to still be an x86 chip by running
    *everything* in a firmware level emulator via dynamic translation.

    That sounds a whole lot like what Transmeta did 25 years ago:

    https://en.wikipedia.org/wiki/Transmeta_Crusoe

    They failed but perhaps things are different now. Their
    native architecture was VLIW which might have been part
    of the problem.


    Might be different now:
    25 years ago, Moore's law was still going strong, and the general
    concern was more about maximizing scalar performance rather than energy >efficiency or core count (and, in those days, processors were generally >single-core).

    IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
    2012) has 8 cores. If IA-64 (and dynamically translating AMD64 to it)
    would be a good idea nowadays, it would not have been canceled.

    How should the number of cores change anything? If you cannot make single-threaded IA-32 or AMD64 programs run at competetive speeds on
    IA-64 hardware, how would that inefficiency be eliminated in
    multi-threaded programs?

    Now we have a different situation:
    Moore's law is dying off;

    Even if that is the case, how should that change anything about the
    relative merits of the two approaches?

    Scalar CPU performance has hit a plateau;

    True, but again, what's the relevance for the discussion at hand?

    And, for many uses, performance is "good enough";

    In that case, better buy a cheaper AMD64 CPU rather than a
    particularly fast CPU with a different architecture X and then run a
    dynamic AMD64->X translator on it.

    A lot more software can make use of multi-threading;

    Possible, but how would it change things?

    Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
    well, whereas something like a RISC style ISA can get better performance
    on a comparably smaller and cheaper core, and with a somewhat better >"performance per watt" metric.

    Evidence?

    Yes, you can run CPUs with Intel P-cores and AMD's non-compact cores
    with higher power limits than what the Apple and Qualcomm chips
    approximately consume (I have not seen proper power consumption
    numbers for these since Anandtech stopped publishing), but you can
    also run Intel CPUs and AMD CPUs at low power limits, with much better "performance per watt". It's just that many buyers of these CPUs care
    about performance, not performance per watt.

    And if you run AMD64 software on your binary translator on CPUs with
    e.g., ARM A64 architecture, the performance per watt is worse than
    when running it on an AMD64 CPU.

    So, one possibility could be, rather than a small number of big/fast
    cores (either VLIW or OoO), possibly a larger number of smaller cores.

    The cores could maybe be LIW or in-order RISC.

    The approach of a large number of small, slow cores has been tried,
    e.g., in the TILE64, but has not been successful with that core size.
    Other examples are Sun's UltraSparc T1000 and followons, which were
    somewhat more successful, but eventually led to the cancellation of
    SPARC.

    Finally, Intel now offers E-core-only chips for clients (e.g., N100)
    and servers (Sierra Forest), but they have not stopped releasing
    P-Core-only server CPUs. For the desktop the CPU with the largest
    numbers of E-Cores (16) also hase 8 P-cores, so Intel obviously
    believes that not all desktop applications are embarrassingly
    parallel.

    Intel used to have Xeon Phi CPUs with a higher number of narrower
    cores, but eventually replaced them with Xeon processors that have
    fewer, but more powerful cores.

    AMD offers compact-core-only server CPUs with more cores and less
    cache per core, but otherwise the same microarchitecture, only with a
    much lower clock ceiling. (There is a difference in microarchitecture
    wrt execurting AVX-512 instructions on Zen5, but that's minor). AMD
    also offers server CPUs with non-compact cores; interestingly, if we
    compare CPUs with the same numbers of cores, the launch price (at the
    same date) is not that far apart:

    GHz
    Model cores base boost cache TDP launch current
    EPYC 9755 128 2.7 4.1 512MB 500W USD12984 EUR5979
    EPYC 9745 128 2.3 3.7 256MB 400W USD12141 EUR4192

    Current pricing from <https://geizhals.eu/?cat=cpuamdam4&xf=12099_Server~25_128~596_Turin~596_Turin+Dense>;
    however, the third-cheapest dealer for the 9745 asks for EUR 6129, and
    the cheapest price up to 2025-09-10 has been EUR 6149, so the current
    price difference may be short-lived. The cheapest price for the 9755
    was 4461 on 2025-08-25, and at that time the 9755 was cheaper than the
    9745 (at least as far as the prices seen by the website above are
    concerned).

    I have thought about why the idea of more smaller cores has not been
    more successful, at least for the kinds of loads where you have a
    large number of independent and individually not particularly
    demanding threads, as in web shops. My explanation is that you need
    1) memory bandwidth and 2) interconnection with the rest of the
    system.

    The interconnection with the rest of the system probably does
    not get much cheaper for the smaller cores, and probably becomes more
    expensive with more cores (e.g., Intel switched from a ring to a grid
    when they increased the cores in their server chips).

    The bandwidth requirements to main memory for given cache sizes per
    core reduce linearly with the performance of the cores; if the larger
    number of smaller cores really leads to increased aggregate
    performance, additional main memory bandwidth is needed, or you can
    compensate for that with larger caches.

    But to eliminate some variables, let's just consider the case where we
    want to get the same performance with the same main memory bandwidth
    from using more smaller cores than we use now. Will the resulting CPU
    require less area? The cache sizes per core are not reduced, and
    their area is not reduced much. The core itself will get smaller, and
    its performance will also get smaller (although by less than the
    core). But if you sum up the total per-core area (core, caches, and interconnect), at some point the per-core area reduces by less than
    the per-core performance, so for a given amount of total performance,
    the area goes up.

    There is one counterargument to these considerations: The largest
    configuration of Turin dense has less cache for more cores than the
    largest configuration of Turin. I expect that's the reason why they
    offer both; if you have less memory-intensive loads, Turin dense with
    the additional cores will give you more performance, otherwise you
    better buy Turin.

    Also, Intel has added 16 E-Cores to their desktop chips without giving
    them the same amount of caches as the P-Cores; e.g., in Arrow lake we
    have

    P-core 48KB D-L0 64KB I-L1 192KB D-L1 3MB L2 3MB L3/core
    E-Core 32KB D-L1 64KB I-L1 4MB L2/4 cores 3MB L3/4cores

    Here we don't have an alternative with more P-Cores and the same
    bandwidth, so we cannot contrast the approaches. But it's certainly
    the case that if you have a bandwidth-hungry load, you don't need to
    buy the Arrow Lake with the largest number of E-Cores.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Fri Sep 19 14:33:44 2025
    From Newsgroup: comp.arch

    BGB <[email protected]> writes:
    Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

    And most Intel and AMD chips have 150W TDP, either, although the
    shenanigans they play with TDP are not nice. The usual TDP for
    Desktop chips is 65W (with the power limits temporarily or permanently
    higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
    a configurable TDP of 8-37W.

    Like, if an ARM chip uses 1/30th the power, unless it is more than 30x >slower, it may still win in Perf/W and similar...

    No TDP numbers are given for Oryon. For Apple's M4, the numbers are

    M4 4P 6E 22W
    M4 Pro 8P 4E 38W
    M4 Pro 10P 4E 46W
    M4 Max 10P 4E 62W
    M4 Max 12P 4E 70W

    Not quite 1/30th of the power, although I think that Apple does not
    play the same shenanigans as Intel and AMD.

    [RISC-V]
    recent proposals for indexed load/store and auto-increment popping up,

    Where can I read about that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Fri Sep 19 18:12:38 2025
    From Newsgroup: comp.arch

    On Fri, 19 Sep 2025 09:50:32 GMT
    [email protected] (Anton Ertl) wrote:


    I have thought about why the idea of more smaller cores has not been
    more successful, at least for the kinds of loads where you have a
    large number of independent and individually not particularly
    demanding threads, as in web shops. My explanation is that you need
    1) memory bandwidth and 2) interconnection with the rest of the
    system.

    The interconnection with the rest of the system probably does
    not get much cheaper for the smaller cores, and probably becomes more expensive with more cores (e.g., Intel switched from a ring to a grid
    when they increased the cores in their server chips).


    That particualr problem is addressed by grouping smaller cores into
    clusters with shared L2 cache. It's especially effective for scaling
    when L2 cache is true inclusive relatively to underlying L1 caches.
    The price is limited L2 bandwidth as seen by the cores.

    BTW, I didn't find any info about replacement policy of Intel's Sierra
    Forest L2 caches.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Fri Sep 19 15:05:56 2025
    From Newsgroup: comp.arch

    EricP <[email protected]> writes:
    Anton Ertl wrote:
    Thomas Koenig <[email protected]> writes:
    BGB <[email protected]> schrieb:

    Still sometimes it seems like it is only a matter of time until Intel or >>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>> hardware level, but then pretends to still be an x86 chip by running
    *everything* in a firmware level emulator via dynamic translation.
    For AMD, that has happend already a few decades ago; they translate
    x86 code into RISC-like microops.

    That's nonsense; regulars of this groups should know better, at least
    this nonsense has been corrected often enough. E.g., I wrote in
    <[email protected]>:

    |Not even if the microcode the Intel and AMD chips used was really
    |RISC-like, which it was not (IIRC the P6 uses micro-instructions with
    |around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop"
    |standing for "RISC").

    I don't know what you are objecting to

    I am objecting to the claim that uops are RISC-like, and that there is
    a translation to RISC occuring inside the CPU, and (not present here,
    but often also claimed) that therefore there is no longer a difference
    between RISC and non-RISC.

    One can discuss the details, but at the end of the day, uops are some implementation-specific internals of the microarchitecture, whereas a
    RISC architecture is an architecture.

    The number of bits has nothing to do with what it is called.
    If this uOp was for a ROB style design where all the knowledge about
    each instruction including register ids, immediate data,
    scheduling info, result data, status, is stored in a single ROB entry,
    then 100 bits sounds pretty small so I'm guessing that was a 32-bit cpu.

    Yes, P6 is the code name for the Pentium Pro, which has a ROB, and,
    more importantly valued reservation stations, and yes, the 118 or
    whatever bits include the operands. I have no idea how the P6 handles
    its 80-bit FP with valued RSs; maybe it has bigger uops in its FP part
    (but I think it has a unified scheduler, so that would not work out,
    or maybe I miss something).

    But concerning the discussion at hand: Containing the data is a
    significant deviation from RISC instruction sets, and RISC
    instructions are typically only 32 bits or 16 bits wide.

    Another difference is that the OoO engine that sees the uOps performs
    only a very small part of the functionality of branches, with the
    majority performed by the front end. I.e., there is no branching in
    the OoO engine that sees the uOps, at the most it confirms the branch
    prediction, or diagnoses a misprediction, at which point the OoO
    engine is out of a job and has to wait for the front end; possibly
    only the ROB (which deals with instructions again) resolves the
    misprediction and kicks the front end into action, however.

    And a uOp triggers that action sequence.
    I don't see the distinction you are trying to make.

    The major point is that the OoO engine (the part that deals with uops)
    sees a linear sequence of uops it has to process, with nearly all
    actual branch processing (which an architecture has to do) done in a
    part that does not deal with uops. With the advent of uop caches that
    has changed a bit, but many of the CPUs for which the uop=RISC claim
    has been made do not have an uop cache.

    It's not entirely clear which parts of the
    engine see MacroOps and ROPs, but my impression was that the MacroOps
    are not split into ROPs for the largest part of the OoO engine.

    AMD explains there terminology here but note that the relationship
    between Macro-Ops and Micro-Ops is micro-architecture specific.

    A Seventh-Generation x86 Microprocessor, 1999 >https://www.academia.edu/download/70925991/4.79985120211001-19357-4pufup.pdf

    "An [micro-]OP is the minimum executable entity understood by the machine."
    A macro-op is a bundle of 1 to 3 micro-ops.
    Simple instructions map to 1 macro and 1-3 micro ops
    and this mapping is done in the decoder.
    Complex instructions map to one or more "micro-lines" each of which
    consists of 3 macro-ops (of 1-3 micro-ops each) pulled from micro-code ROM.

    Yes, so much is clear. It's not clear where Macro-Ops are in play and
    where Micro-Ops are in play. Over time I get the impression that the
    macro-ops are the main thing running through the OoO engine, and
    Micro-Ops are only used in specific places, but it's completely
    unclear to me where. E.g., if they let an RMW Macro-Op run through
    the OoO engine, it would first go to the LSU for the address
    generation, translation and load, then to the ALU for the
    modification, then to the LSU for the store, and then to the ROB.
    Where in this whole process is a Micro-Op actually stored?

    This is a bit introductory level:

    Book
    Computer Organization and Design
    The Hardware/Software Interface: RISC-V Edition, 2018
    Patterson, Hennessy

    Their "Computer Architecture" book is also revised every few years,
    but their treatment of OoO makes me think that they are not at all
    interested in that part anymore, instead more in, e.g., multiprocessor
    memory subsystems.

    And the fact that we see so few recent books on the topics makes me
    think that many in academia have decided that this is a topic that
    they leave to industry.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Sep 19 16:14:53 2025
    From Newsgroup: comp.arch


    [email protected] (Anton Ertl) posted:

    EricP <[email protected]> writes:
    Anton Ertl wrote:
    Thomas Koenig <[email protected]> writes:
    BGB <[email protected]> schrieb:
    -------------------------------

    Yes, so much is clear. It's not clear where Macro-Ops are in play and
    where Micro-Ops are in play. Over time I get the impression that the macro-ops are the main thing running through the OoO engine, and
    Micro-Ops are only used in specific places, but it's completely
    unclear to me where. E.g., if they let an RMW Macro-Op run through
    the OoO engine, it would first go to the LSU for the address
    generation, translation and load, then to the ALU for the
    modification, then to the LSU for the store, and then to the ROB.
    Where in this whole process is a Micro-Op actually stored?

    In the reservation station.


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Sep 19 16:23:06 2025
    From Newsgroup: comp.arch


    [email protected] (Anton Ertl) posted:

    BGB <[email protected]> writes:
    On 9/17/2025 4:33 PM, John Levine wrote:
    According to BGB <[email protected]>:
    --------------------------------------

    I have thought about why the idea of more smaller cores has not been
    more successful, at least for the kinds of loads where you have a
    large number of independent and individually not particularly
    demanding threads, as in web shops. My explanation is that you need
    1) memory bandwidth and 2) interconnection with the rest of the
    system.

    Yes, exactly:: if you have a large number of cores doing a performance of
    X, they will need exactly the same memory BW as a smaller number of cores
    also performing at X.

    In addition, the interconnect has to be at least as good as the small core system.

    The interconnection with the rest of the system probably does
    not get much cheaper for the smaller cores, and probably becomes more expensive with more cores (e.g., Intel switched from a ring to a grid
    when they increased the cores in their server chips).

    The bandwidth requirements to main memory for given cache sizes per
    core reduce linearly with the performance of the cores; if the larger
    number of smaller cores really leads to increased aggregate
    performance, additional main memory bandwidth is needed, or you can compensate for that with larger caches.

    Sooner or later, you actually have to read/write main memory.

    But to eliminate some variables, let's just consider the case where we
    want to get the same performance with the same main memory bandwidth
    from using more smaller cores than we use now. Will the resulting CPU require less area? The cache sizes per core are not reduced, and
    their area is not reduced much.

    A core running at ½ the performance can use a cache that is ¼ the size
    and see the same percentage degradation WRT cache misses (as long as
    main memory is equally latent). TLBs too.

    The core itself will get smaller, and

    12× smaller and 12× lower power

    its performance will also get smaller (although by less than the
    core).

    for ½ the performance

    But if you sum up the total per-core area (core, caches, and interconnect), at some point the per-core area reduces by less than
    the per-core performance, so for a given amount of total performance,
    the area goes up.

    GBOoO Cores tend to be about the size of 512KB of L2

    There is one counterargument to these considerations: The largest configuration of Turin dense has less cache for more cores than the
    largest configuration of Turin. I expect that's the reason why they
    offer both; if you have less memory-intensive loads, Turin dense with
    the additional cores will give you more performance, otherwise you
    better buy Turin.

    Also, Intel has added 16 E-Cores to their desktop chips without giving
    them the same amount of caches as the P-Cores; e.g., in Arrow lake we
    have

    P-core 48KB D-L0 64KB I-L1 192KB D-L1 3MB L2 3MB L3/core
    E-Core 32KB D-L1 64KB I-L1 4MB L2/4 cores 3MB L3/4cores

    Here we don't have an alternative with more P-Cores and the same
    bandwidth, so we cannot contrast the approaches. But it's certainly
    the case that if you have a bandwidth-hungry load, you don't need to
    buy the Arrow Lake with the largest number of E-Cores.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Fri Sep 19 11:41:19 2025
    From Newsgroup: comp.arch

    On 9/19/2025 9:33 AM, Anton Ertl wrote:
    BGB <[email protected]> writes:
    Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

    And most Intel and AMD chips have 150W TDP, either, although the
    shenanigans they play with TDP are not nice. The usual TDP for
    Desktop chips is 65W (with the power limits temporarily or permanently higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
    a configurable TDP of 8-37W.



    Seems so...
    Seems the CPU I am running as a 105W TDP, I had thought I remembered
    150W, oh well...

    Seems 150-200W is more Threadripper territory, and not the generic
    desktop CPUs.


    Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
    slower, it may still win in Perf/W and similar...

    No TDP numbers are given for Oryon. For Apple's M4, the numbers are

    M4 4P 6E 22W
    M4 Pro 8P 4E 38W
    M4 Pro 10P 4E 46W
    M4 Max 10P 4E 62W
    M4 Max 12P 4E 70W

    Not quite 1/30th of the power, although I think that Apple does not
    play the same shenanigans as Intel and AMD.



    A lot of the ARM SoC's I had seen had lower TDPs, though more often with Cortex A53 or A55/A78 cores or similar:

    Say (MediaTek MT6752):
    https://unite4buy.com/cpu/MediaTek-MT6752/
    Has a claimed TDP here of 7W and has 8x A53.

    Or, for a slightly newer chip (2020):
    https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j

    TDP 5W, has A55 and A78 cores.


    Some amount of the HiSilicon numbers look similar...


    But, yeah, I guess if using these as data-points:
    A55: ~ 5/8W, or ~ 0.625W (very crude)
    Zen+: ~ 105/16W, ~ 6.56W

    So, more like 10x here, but ...


    Then, I guess it becomes a question of the relative performance
    difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...

    Judging based on my cellphone (with A53 cores), and previously running
    my emulator in Termux, there is a performance difference, but nowhere
    near 10x.


    Probably need to set up a RasPi with a 64-bit OS at some point and see
    how this performs... (wouldn't really be as accurate to compare x86-64
    with 32-bit ARM).


    [RISC-V]
    recent proposals for indexed load/store and auto-increment popping up,

    Where can I read about that.


    For now, just on the mailing lists, eg: https://lists.riscv.org/g/tech-arch-review/message/368


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Fri Sep 19 12:00:07 2025
    From Newsgroup: comp.arch

    On 9/19/2025 4:50 AM, Anton Ertl wrote:
    BGB <[email protected]> writes:
    On 9/17/2025 4:33 PM, John Levine wrote:
    According to BGB <[email protected]>:
    Still sometimes it seems like it is only a matter of time until Intel or >>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>> hardware level, but then pretends to still be an x86 chip by running
    *everything* in a firmware level emulator via dynamic translation.

    That sounds a whole lot like what Transmeta did 25 years ago:

    https://en.wikipedia.org/wiki/Transmeta_Crusoe

    They failed but perhaps things are different now. Their
    native architecture was VLIW which might have been part
    of the problem.


    Might be different now:
    25 years ago, Moore's law was still going strong, and the general
    concern was more about maximizing scalar performance rather than energy
    efficiency or core count (and, in those days, processors were generally
    single-core).

    IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
    2012) has 8 cores. If IA-64 (and dynamically translating AMD64 to it)
    would be a good idea nowadays, it would not have been canceled.

    How should the number of cores change anything? If you cannot make single-threaded IA-32 or AMD64 programs run at competetive speeds on
    IA-64 hardware, how would that inefficiency be eliminated in
    multi-threaded programs?

    Now we have a different situation:
    Moore's law is dying off;

    Even if that is the case, how should that change anything about the
    relative merits of the two approaches?

    Scalar CPU performance has hit a plateau;

    True, but again, what's the relevance for the discussion at hand?

    And, for many uses, performance is "good enough";

    In that case, better buy a cheaper AMD64 CPU rather than a
    particularly fast CPU with a different architecture X and then run a
    dynamic AMD64->X translator on it.


    Possibly, it depends.

    The question is what could Intel or AMD do if the wind blew in that
    direction.

    For the end-user, the experience is likely to look similar, so they
    might not need to know/care if they are using some lower-power native
    chip, or something that is internally running on a dynamic translator to
    some likely highly specialized ISA.



    A lot more software can make use of multi-threading;

    Possible, but how would it change things?


    Multi-threaded software does not tend to depend as much on single-thread performance as single threaded software...


    Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
    well, whereas something like a RISC style ISA can get better performance
    on a comparably smaller and cheaper core, and with a somewhat better
    "performance per watt" metric.

    Evidence?


    No hard numbers, but experience here:
    ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
    ARM11 cores).

    The RasPi basically runs circles around the Eee...


    Though, no good datapoints for fast x86 emulators here.
    At least DOSBox and QEMU running x86 on RasPi tend to be dead slow.



    ( no time right now, so skipping rest )

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Fri Sep 19 12:38:51 2025
    From Newsgroup: comp.arch

    On 9/19/2025 12:00 PM, BGB wrote:
    On 9/19/2025 4:50 AM, Anton Ertl wrote:
    BGB <[email protected]> writes:
    On 9/17/2025 4:33 PM, John Levine wrote:
    According to BGB  <[email protected]>:
    Still sometimes it seems like it is only a matter of time until
    Intel or
    AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>>> hardware level, but then pretends to still be an x86 chip by running >>>>> *everything* in a firmware level emulator via dynamic translation.

    That sounds a whole lot like what Transmeta did 25 years ago:

    https://en.wikipedia.org/wiki/Transmeta_Crusoe

    They failed but perhaps things are different now.  Their
    native architecture was VLIW which might have been part
    of the problem.


    Might be different now:
    25 years ago, Moore's law was still going strong, and the general
    concern was more about maximizing scalar performance rather than energy
    efficiency or core count (and, in those days, processors were generally
    single-core).

    IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
    2012) has 8 cores.  If IA-64 (and dynamically translating AMD64 to it)
    would be a good idea nowadays, it would not have been canceled.

    How should the number of cores change anything?  If you cannot make
    single-threaded IA-32 or AMD64 programs run at competetive speeds on
    IA-64 hardware, how would that inefficiency be eliminated in
    multi-threaded programs?

    Now we have a different situation:
       Moore's law is dying off;

    Even if that is the case, how should that change anything about the
    relative merits of the two approaches?

       Scalar CPU performance has hit a plateau;

    True, but again, what's the relevance for the discussion at hand?

       And, for many uses, performance is "good enough";

    In that case, better buy a cheaper AMD64 CPU rather than a
    particularly fast CPU with a different architecture X and then run a
    dynamic AMD64->X translator on it.


    Possibly, it depends.

    The question is what could Intel or AMD do if the wind blew in that direction.

    For the end-user, the experience is likely to look similar, so they
    might not need to know/care if they are using some lower-power native
    chip, or something that is internally running on a dynamic translator to some likely highly specialized ISA.



       A lot more software can make use of multi-threading;

    Possible, but how would it change things?


    Multi-threaded software does not tend to depend as much on single-thread performance as single threaded software...


    Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
    well, whereas something like a RISC style ISA can get better performance >>> on a comparably smaller and cheaper core, and with a somewhat better
    "performance per watt" metric.

    Evidence?


    No hard numbers, but experience here:
    ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
    ARM11 cores).

    The RasPi basically runs circles around the Eee...


    Though, no good datapoints for fast x86 emulators here.
      At least DOSBox and QEMU running x86 on RasPi tend to be dead slow.



    ( no time right now, so skipping rest )


    Seems I have a little time still...

    Did find this: https://browser.geekbench.com/v4/cpu/compare/2498562?baseline=2792960

    Not an exact match, I think the Eee was running the Atom at a somewhat
    lower clock speed; and this is vs a Pi3 vs original Pi.
    The Pi3 having 4x A53 cores.


    But, yeah, they are roughly matched on single thread performance when
    the Atom has a clock-speed advantage.

    Though, this seems to imply that they are more just "comparable" on the performance front, rather than Atom being significantly slower...


    Would need to try to dig-out the Eee and re-test, assuming it still
    works/etc.


    --- Synchronet 3.21a-Linux NewsLink 1.2