• Article on the 8088 bus cycle

    From Thomas Koenig@[email protected] to comp.arch on Mon May 13 17:02:57 2024
    From Newsgroup: comp.arch

    For anybody who ever wondered what exactly the 8088 was doing
    while it was wast^H^H^H^Husing four cycles per memory access,
    here's an interesting article:

    http://www.righto.com/2024/04/intel-8088-bus-state-machine.html
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@[email protected] (MitchAlsup1) to comp.arch on Mon May 13 17:21:36 2024
    From Newsgroup: comp.arch

    Thomas Koenig wrote:

    For anybody who ever wondered what exactly the 8088 was doing
    while it was wast^H^H^H^Husing four cycles per memory access,
    here's an interesting article:

    http://www.righto.com/2024/04/intel-8088-bus-state-machine.html

    That was fun, thanks.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@[email protected] to comp.arch on Wed May 15 13:37:04 2024
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    For anybody who ever wondered what exactly the 8088 was doing
    while it was wast^H^H^H^Husing four cycles per memory access,
    here's an interesting article:

    http://www.righto.com/2024/04/intel-8088-bus-state-machine.html

    I always thought the 4 cycles was inherited from the 8080/Z80 cpus and
    their support chips which the PC was going to use.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@[email protected] to comp.arch on Wed May 15 15:30:16 2024
    From Newsgroup: comp.arch

    On 5/15/2024 6:37 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    For anybody who ever wondered what exactly the 8088 was doing
    while it was wast^H^H^H^Husing four cycles per memory access,
    here's an interesting article:

    http://www.righto.com/2024/04/intel-8088-bus-state-machine.html

    I always thought the 4 cycles was inherited from the 8080/Z80 cpus and
    their support chips which the PC was going to use.


    Though, does seem like that era of RAM is easier to deal with than more
    modern RAM (such as DDR2/DDR3), where one needs to deal with
    opening/closing rows and using burst transfers to send/receive data, ...


    Though, 4 cycles per byte would not be fast, particularly at low
    clock-speeds.

    Still kinda interesting in a way.




    Seems like probably in a similar area as QSPI RAM.

    Quick skim, looks like QSPI RAM access looks something like:
    Pull CS low;
    Send command byte;
    Send address bytes (4);
    Send/receive data bytes;
    CS goes high when transfer is done;
    CS going high apparently puts the chip back in its idle state.


    If you do a 16-byte burst, this would be ~ 1.4 cycles (DDR) per data
    byte, or 2.8 cycles if driving it from a faster SDR clock. A datasheet
    for a random QSPI RAM chip I found suggests it has a maximum operating frequency of around 54 MHz (so, a little lower than the DDR chips), and
    a lot are apparently "pseudo static" (they are DRAM internally, but also perform their own RAM refresh, appearing as SRAM from the POV of the
    external bus interface).


    Apparently also if the 4 IO lines are pulled high, and it is then given
    8 clock cycles with CS pulled low, the chips will go into SPI mode and
    expect plain SPI signaling from then on (seems to be similar to how it
    works with SDcards, *). Similarly, QSPI Flash seems like an intermediate between the SRAM interface and an SDcard.


    *: But makes me wonder if the reverse is true:
    If the 4-bit DDR signaling for SDcards basically the same protocol that
    one uses over SPI, just sent 4 bits at a time on the rising/falling
    clock edges?... (Effectively following a similar pattern to the QSPI RAM
    and Flash modules).


    In my case, I hadn't found much information about accessing SDcards in anything other than SPI mode, but I guess if it turns out it is
    basically just QSPI and the same byte-oriented protocol as before, that
    would be useful to know.

    Though, not of much immediate relevance:
    Would still need to design a more efficient MMIO interface or a DMA
    mechanism or similar before I could gain much additional bandwidth from
    such a thing.

    The original 8-bit MMIO interface hitting a wall at around 600K/s, and
    the current 64-bit MMIO interface having a hard-limit of ~ 4MB/s; vs ~
    20-50 MB/s which could theoretically be usable in a QSPI like mode,
    provided there is a way to access it. Reading to/from an MMIO buffer
    would still be a bottleneck, would need some way to initiate a
    block-transfer to/from the L2 cache.

    I guess, could theoretically hack it onto my existing MMIO interface,
    just beyond just the EMIT and EMIX8X flags, there could be an EMITDMA_R/EMITDMA_W flags or similar, where the 64-bit SPI_QDATA
    register is instead used to encode a memory address and a block size,
    and the SPI module will read/write the requested block (then maybe add
    other flag for SPI vs QSPI/SDR and QSPI/DDR operation, ...).


    Terje


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@[email protected] (MitchAlsup1) to comp.arch on Wed May 15 21:41:32 2024
    From Newsgroup: comp.arch

    BGB wrote:

    Seems like probably in a similar area as QSPI RAM.

    Quick skim, looks like QSPI RAM access looks something like:
    Pull CS low;
    Send command byte;
    Send address bytes (4);
    Send/receive data bytes;
    CS goes high when transfer is done;
    CS going high apparently puts the chip back in its idle state.


    If you do a 16-byte burst, this would be ~ 1.4 cycles (DDR) per data
    byte, or 2.8 cycles if driving it from a faster SDR clock. A datasheet
    for a random QSPI RAM chip I found suggests it has a maximum operating frequency of around 54 MHz (so, a little lower than the DDR chips), and
    a lot are apparently "pseudo static" (they are DRAM internally, but also perform their own RAM refresh, appearing as SRAM from the POV of the external bus interface).

    You are forgetting that DRAM RAS occurs after the first 2 address bytes
    are latched, and that CAS occurs after the second 2 address bits are
    latched {and that you are in a deRAS deCAS state already.}

    QSPI at 54 Mhz is just under 20ns per DRAM address/command event not that
    much different than current DDRs

    But what current DDRs can do is to partially overlap address/command with
    data transfer--I would suspect QSPI could do this too.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@[email protected] to comp.arch on Thu May 16 01:23:20 2024
    From Newsgroup: comp.arch

    On Wed, 15 May 2024 21:41:32 +0000
    [email protected] (MitchAlsup1) wrote:

    BGB wrote:

    Seems like probably in a similar area as QSPI RAM.

    Quick skim, looks like QSPI RAM access looks something like:
    Pull CS low;
    Send command byte;
    Send address bytes (4);
    Send/receive data bytes;
    CS goes high when transfer is done;
    CS going high apparently puts the chip back in its idle state.



    If you do a 16-byte burst, this would be ~ 1.4 cycles (DDR) per
    data byte, or 2.8 cycles if driving it from a faster SDR clock. A
    datasheet for a random QSPI RAM chip I found suggests it has a
    maximum operating frequency of around 54 MHz (so, a little lower
    than the DDR chips), and a lot are apparently "pseudo static" (they
    are DRAM internally, but also perform their own RAM refresh,
    appearing as SRAM from the POV of the external bus interface).

    You are forgetting that DRAM RAS occurs after the first 2 address
    bytes are latched, and that CAS occurs after the second 2 address
    bits are latched {and that you are in a deRAS deCAS state already.}


    We are talking about 1978 here. Back then, it was 7-bit raw addres
    and 7-bit column addresss. I don't know how they applied address bits
    above A13. Later on there were chip select and/or output enable signals,
    but circa-1978 16-bit DRAM had neither of those. So, it seems, if one
    wanted to put more than 16 KB on 8-bit bus, one had to generated
    multiple sets of RAS# and CAS# signals.

    Since then # of rows, first per DRAM chip and later per bank, grew tremendously, but number of columns only grew by factor of 8 and
    remains the same for more than 20 years.

    QSPI at 54 Mhz is just under 20ns per DRAM address/command event not
    that much different than current DDRs

    But what current DDRs can do is to partially overlap address/command
    with data transfer--I would suspect QSPI could do this too.

    Partially?
    With typical CL=14==28T, and with burst that just recently was
    increased to 16T and before that stayed at 8T for more than decade, I'd
    say that it's full overlap ++.





    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Schultz@[email protected] to comp.arch on Wed May 15 17:35:58 2024
    From Newsgroup: comp.arch

    On 5/15/24 3:30 PM, BGB wrote:
    In my case, I hadn't found much information about accessing SDcards in anything other than SPI mode, but I guess if it turns out it is
    basically just QSPI and the same byte-oriented protocol as before, that would be useful to know.

    The SD card specification covers both in detail.

    But the 4 bit wide SD access will depend a lot on what hardware support
    you have. I used an ARM (in a Teensy 3.2) to do the job. Mostly quite
    similar to SPI access.
    --
    http://davesrocketworks.com
    David Schultz

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@[email protected] to comp.arch on Wed May 15 18:48:44 2024
    From Newsgroup: comp.arch

    On 5/15/2024 5:35 PM, David Schultz wrote:
    On 5/15/24 3:30 PM, BGB wrote:
    In my case, I hadn't found much information about accessing SDcards in
    anything other than SPI mode, but I guess if it turns out it is
    basically just QSPI and the same byte-oriented protocol as before,
    that would be useful to know.

    The SD card specification covers both in detail.


    When looking around before, I generally only found people talking about
    the SPI interfaces. Granted, I didn't have the official specifications
    (which seemed to be paywalled and not generally available otherwise). I
    mostly implemented stuff based on information I found on various websites.


    Though, looking around some more, apparently the 4-bit interface is
    basically the same protocol (at the byte level) as the 1-bit SPI
    protocol, but differs mostly in that it is 4-bit SDR (apparently the DDR variants are specific to UHS-I / UHS-II, and the "Full Speed" variant
    that is SDR).

    Apparently, it is also functionally equivalent to QSPI is most other
    regards (well, apart from QSPI RAM/Flash and SDcards having different communication protocols).


    But the 4 bit wide SD access will depend a lot on what hardware support
    you have. I used an ARM (in a Teensy 3.2) to do the job. Mostly quite similar to SPI access.


    I am mostly using FPGA's here, so in theory shouldn't be too much effort
    to add QSPI support.


    Bigger challenge is figuring out how to best modify the design of the
    SPI MMIO interface to be able to make use of the higher data transfer
    speeds.

    Though most immediate solution would probably just be to add some more
    MMIO registers to increase the data-transfer size; though this quickly
    turns into diminishing returns.

    Another possibility being to widen the MMIO Bus interface to 128 bits.

    Mostly, the limiting factor is that it takes roughly 24 clock cycles for
    every access to the MMIO bus in my case.


    Though, looks like increasing the transfer size to 32 bytes (via adding
    more MMIO registers and a few more control bits) would increase the
    bottleneck to around 14MB/s, which looks like mostly enough to make
    effective use of SDcard 4-bit / Full Speed mode...




    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Schultz@[email protected] to comp.arch on Wed May 15 19:49:34 2024
    From Newsgroup: comp.arch

    On 5/15/24 6:48 PM, BGB wrote:


    When looking around before, I generally only found people talking about
    the SPI interfaces. Granted, I didn't have the official specifications (which seemed to be paywalled and not generally available otherwise). I mostly implemented stuff based on information I found on various websites.


    Parts of it are/were paywalled but the interesting bit hasn't been. At
    least as far back as SD 1.0.

    https://www.sdcard.org/downloads/pls/

    I am not sure which one of those is the one to look at. Probably the one
    with the 9.1 revision level.
    --
    http://davesrocketworks.com
    David Schultz

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@[email protected] to comp.arch on Thu May 16 00:21:28 2024
    From Newsgroup: comp.arch

    Michael S wrote:

    We are talking about 1978 here. Back then, it was 7-bit raw addres
    and 7-bit column addresss. I don't know how they applied address bits
    above A13. Later on there were chip select and/or output enable signals,
    but circa-1978 16-bit DRAM had neither of those. So, it seems, if one
    wanted to put more than 16 KB on 8-bit bus, one had to generated
    multiple sets of RAS# and CAS# signals.

    Yes. I'm looking at a 1979 Motorola Memory book and both the
    16k*1 MCM4116A and 64k*1 MCM6664 were 16 pin packages without
    chip select. You controlled them by gating the RAS and CAS.
    At access time after CAS the output switches from high-z to valid data,
    so at least you can just wire the outputs together.

    The 4116A's were a pain because they required power of +12V, +5V, -5V, GND
    (but so did the 8080) but the power supplies often only produce +5V
    so you had to use voltage pumps for +12V and -5V, and those pumps were
    really inefficient and got really hot, so now you needed a fan too.

    The 6664's only required 5V and GND.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@[email protected] to comp.arch on Thu May 16 16:21:02 2024
    From Newsgroup: comp.arch

    On 5/15/2024 7:49 PM, David Schultz wrote:
    On 5/15/24 6:48 PM, BGB wrote:


    When looking around before, I generally only found people talking
    about the SPI interfaces. Granted, I didn't have the official
    specifications (which seemed to be paywalled and not generally
    available otherwise). I mostly implemented stuff based on information
    I found on various websites.


    Parts of it are/were paywalled but the interesting bit hasn't been. At
    least as far back as SD 1.0.

    https://www.sdcard.org/downloads/pls/

    I am not sure which one of those is the one to look at. Probably the one with the 9.1 revision level.


    Yeah. Had found the 8.0 spec, which seemed to have some useful information.

    Well, and some SDcard features which I don't really get; like apparently
    it is possible to route PCIe over the SDcard interface, guessing
    probably only some cards would support this. Like, would this be so
    someone could stick an SDcard in an M.2 adapter or similar?...


    But, yeah, for modifications to my MMIO interface, I went with:
    Adding QDATA1..QDATA3 registers, which can increase the transfer size to
    32 bytes. Technically, these are using misaligned MMIO addresses, which
    are admittedly a route I have ended up going in a few cases (I had
    originally only given ~ 16B of MMIO space to the device).

    Say:
    FFFF_F000E030: Status/Control
    FFFF_F000E034: DataB (8-bit transfer)
    FFFF_F000E038: DataQ (64-bit transfer)
    FFFF_F000E040: Well, PS/2 Keyboard sits here...
    FFFF_F000E050: PS/2 Mouse...
    ...

    So, ended up redefining things:
    FFFF_F000E038: DataQ / DataQ0 (64 or 256 bit)
    FFFF_F000E039: DataQ1 (256-bit transfer)
    FFFF_F000E03A: DataQ2 (256-bit transfer)
    FFFF_F000E03B: DataQ3 (256-bit transfer)

    The status/control register was used to check the status of SPI
    transfers, and to accept commands for sending data over SPI.


    Using larger transfers is partly because access to the MMIO interface
    has a roughly 24 cycle latency, and using a bigger transfer size reduces
    the amount of accesses that would be spent on checking the BUSY status
    or writing to the Control register for bulk data.


    Added bits to status/control:
    QSPI: Signals to use a QSPI transfer (vs SPI)
    READ: Signals a READ operation (vs SWAP or WRITE)
    DDR: Signals the use of DDR vs SDR
    XMIT32B: Send/Receive 32 bytes.
    With a few other bits having existed:
    XMIT8B: Send/receive 8 bytes.
    XMIT: Send/Receive 1 byte.
    BUSY: SPI is still busy.

    The high-order bits of the Status/Control register are used to encode a
    clock division relative to the base-clock.

    In this case, say:
    XMITxx by itself will try to send data though CMD and DATA0 pins.
    Using DATA3 as the CS pin.
    XMITxx+READ: Will try to read data via the CMD pin;
    XMITxx+QSPI: Will send data via DATA0..DATA3;
    XMITxx+QSPI+READ: Will receive data via DATA0..DATA3;
    ...

    Note that the routing of the signals to the SDcard pins is via glue
    logic external to the module itself (its output signals are more or less
    plain QSPI).


    Base XMITxx would be used for SPI mode, and for writing commands.


    Seems I was wrong about something before:
    The combination of CMD0+CS is what selects SPI mode (for normal SD mode,
    the CS signal would not be asserted).
    I was mistaken in thinking that it was the stream of FF bytes and then asserting CS near the end (say, the init procedure for the SDcard
    involving sending a large number of FF bytes at a fairly low speed, then sending some other commands and boosting the speed to the intended
    operating speed).


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Schultz@[email protected] to comp.arch on Thu May 16 18:05:03 2024
    From Newsgroup: comp.arch

    On 5/16/24 4:21 PM, BGB-Alt wrote:
    Seems I was wrong about something before:
    The combination of CMD0+CS is what selects SPI mode (for normal SD mode,
    the CS signal would not be asserted).
    I was mistaken in thinking that it was the stream of FF bytes and then asserting CS near the end (say, the init procedure for the SDcard
    involving sending a large number of FF bytes at a fairly low speed, then sending some other commands and boosting the speed to the intended
    operating speed).


    The low speed was a holdover from its previous life as the MMC card
    standard. (I still have a 64MB MMC card.) Which were intended to be a multi-drop solution. As such some of the signal lines were open-drain/collector so the speed was limited by the pullup resistor and parasitic capacitance.
    --
    http://davesrocketworks.com
    David Schultz

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@[email protected] to comp.arch on Thu May 16 21:10:05 2024
    From Newsgroup: comp.arch

    On 5/16/2024 6:05 PM, David Schultz wrote:
    On 5/16/24 4:21 PM, BGB-Alt wrote:
    Seems I was wrong about something before:
    The combination of CMD0+CS is what selects SPI mode (for normal SD
    mode, the CS signal would not be asserted).
    I was mistaken in thinking that it was the stream of FF bytes and then
    asserting CS near the end (say, the init procedure for the SDcard
    involving sending a large number of FF bytes at a fairly low speed,
    then sending some other commands and boosting the speed to the
    intended operating speed).


    The low speed was a holdover from its previous life as the MMC card standard. (I still have a 64MB MMC card.) Which were intended to be a multi-drop solution. As such some of the signal lines were open-drain/collector so the speed was limited by the pullup resistor and parasitic capacitance.


    Yeah.

    Set to 400kHz, send ~ 8K or so worth of FF bytes, send CMD0, etc, then
    boost up to 12.5 MHz or similar.

    Originally, I was using 5 MHz, but did speed things up. I ended up using
    LZ compression on lots of stuff, as 5MHz was fairly slow (and LZ
    compression made it faster).

    Initially, going beyond 5MHz wasn't done also as this was basically the
    same as the bottleneck for the original MMIO interface (so I went to a
    higher speed around the same time that I widened transfers on the MMIO interface from 8 bits to 64 bits).



    Theoretically, stuff should work at 25MHz, but generally I had not found operation much over 12.5 MHz to be reliable, 16.7 MHz tended to be
    unreliable, and 25 MHz didn't work.

    Divider logic also limits things, say:
    50MHz base clock allows for:
    25MHz, 16.7MHz, 12.5MHz, 10MHz, 8.3MHz, ...
    DDR operation will not currently be possible over 12.5 MHz.

    Here, we need at least 4 cycles for the state-machine logic to do its
    thing for DDR to work.


    Granted, there was some uncertainty as I also tended to use a microSD to full-size SD adapter cable, using the full-size SD cards for testing
    (where potentially the adapter cable could be reducing what clock-speeds
    are usable). These adaptors use a small PCB that plugs into the microSD
    slot, a piece of flat-flex ribbon cable extending from this PCB (~ 8
    inches long), and a full-size SDcard interface on the other end, with in
    this case the flat-flex seemingly soldered directly to the PCBs.

    Mostly this was because microSD cards are too small to handle
    effectively (say, if you drop a full size SDcard, it doesn't just
    disappear in the carpet or similar).

    But, the FPGA boards generally only come with MicroSD cards for whatever reason.

    Mostly using 16GB UHS-I cards, as IIRC they were fairly affordable on
    Amazon at the time (and didn't need much bigger for my projects).


    Originally I went for SPI as it was free to use, but the native
    ("Default Speed" / SDR) mode should now be OK as any patents on this
    will have presumably expired (and the 4x faster speeds could be worthwhile).




    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@[email protected] to comp.arch on Sat May 18 15:38:30 2024
    From Newsgroup: comp.arch

    On 5/15/2024 4:41 PM, MitchAlsup1 wrote:
    BGB wrote:

    Seems like probably in a similar area as QSPI RAM.

    Quick skim, looks like QSPI RAM access looks something like:
       Pull CS low;
       Send command byte;
       Send address bytes (4);
       Send/receive data bytes;
       CS goes high when transfer is done;
         CS going high apparently puts the chip back in its idle state.


    If you do a 16-byte burst, this would be ~ 1.4 cycles (DDR) per data
    byte, or 2.8 cycles if driving it from a faster SDR clock. A datasheet
    for a random QSPI RAM chip I found suggests it has a maximum operating
    frequency of around 54 MHz (so, a little lower than the DDR chips),
    and a lot are apparently "pseudo static" (they are DRAM internally,
    but also perform their own RAM refresh, appearing as SRAM from the POV
    of the external bus interface).

    You are forgetting that DRAM RAS occurs after the first 2 address bytes
    are latched, and that CAS occurs after the second 2 address bits are
    latched {and that you are in a deRAS deCAS state already.}

    QSPI at 54 Mhz is just under 20ns per DRAM address/command event not that much different than current DDRs

    But what current DDRs can do is to partially overlap address/command with data transfer--I would suspect QSPI could do this too.

    I suspect QSPI could not do this, as it does not have separate pins for command and data.

    IOW:
    CS, CLK, D0..D3


    Seemingly, it seems to be using linear addresses rather than row/column
    (so, unlike DDRx); seemingly with 32-bit addresses despite most of the
    chips probably would be able to get by with 24 (maybe it was "less bad"
    to waste an extra byte on the address than to have separate versions of
    the messaging protocol, or separate 24/32-bit command variants?...).


    Though, SDcard has separate command and data pins in SD mode, I guess
    this leaves open whether one can send commands in the middle of a block read/write. Probably not worth it though as with 512 byte blocks, it
    probably wouldn't make much difference (unless the SDcard had a fairly
    large internal access latency or something).

    ...

    --- Synchronet 3.20a-Linux NewsLink 1.114