Forum: War Ensemble BBS

Article on the 8088 bus cycle

From Thomas Koenig@[email protected] to comp.arch on Mon May 13 17:02:57 2024

From Newsgroup: comp.arch

For anybody who ever wondered what exactly the 8088 was doing
while it was wast^H^H^H^Husing four cycles per memory access,
here's an interesting article:

http://www.righto.com/2024/04/intel-8088-bus-state-machine.html
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@[email protected] (MitchAlsup1) to comp.arch on Mon May 13 17:21:36 2024

From Newsgroup: comp.arch

Thomas Koenig wrote:

For anybody who ever wondered what exactly the 8088 was doing
while it was wast^H^H^H^Husing four cycles per memory access,
here's an interesting article:

http://www.righto.com/2024/04/intel-8088-bus-state-machine.html

That was fun, thanks.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@[email protected] to comp.arch on Wed May 15 13:37:04 2024

From Newsgroup: comp.arch

Thomas Koenig wrote:

For anybody who ever wondered what exactly the 8088 was doing
while it was wast^H^H^H^Husing four cycles per memory access,
here's an interesting article:

http://www.righto.com/2024/04/intel-8088-bus-state-machine.html

I always thought the 4 cycles was inherited from the 8080/Z80 cpus and
their support chips which the PC was going to use.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@[email protected] to comp.arch on Wed May 15 15:30:16 2024

From Newsgroup: comp.arch

On 5/15/2024 6:37 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

For anybody who ever wondered what exactly the 8088 was doing
while it was wast^H^H^H^Husing four cycles per memory access,
here's an interesting article:

http://www.righto.com/2024/04/intel-8088-bus-state-machine.html

I always thought the 4 cycles was inherited from the 8080/Z80 cpus and
their support chips which the PC was going to use.

Though, does seem like that era of RAM is easier to deal with than more
modern RAM (such as DDR2/DDR3), where one needs to deal with
opening/closing rows and using burst transfers to send/receive data, ...

Though, 4 cycles per byte would not be fast, particularly at low
clock-speeds.

Still kinda interesting in a way.

Seems like probably in a similar area as QSPI RAM.

Quick skim, looks like QSPI RAM access looks something like:
Pull CS low;
Send command byte;
Send address bytes (4);
Send/receive data bytes;
CS goes high when transfer is done;
CS going high apparently puts the chip back in its idle state.

If you do a 16-byte burst, this would be ~ 1.4 cycles (DDR) per data
byte, or 2.8 cycles if driving it from a faster SDR clock. A datasheet
for a random QSPI RAM chip I found suggests it has a maximum operating frequency of around 54 MHz (so, a little lower than the DDR chips), and
a lot are apparently "pseudo static" (they are DRAM internally, but also perform their own RAM refresh, appearing as SRAM from the POV of the
external bus interface).

Apparently also if the 4 IO lines are pulled high, and it is then given
8 clock cycles with CS pulled low, the chips will go into SPI mode and
expect plain SPI signaling from then on (seems to be similar to how it
works with SDcards, *). Similarly, QSPI Flash seems like an intermediate between the SRAM interface and an SDcard.

*: But makes me wonder if the reverse is true:
If the 4-bit DDR signaling for SDcards basically the same protocol that
one uses over SPI, just sent 4 bits at a time on the rising/falling
clock edges?... (Effectively following a similar pattern to the QSPI RAM
and Flash modules).

In my case, I hadn't found much information about accessing SDcards in anything other than SPI mode, but I guess if it turns out it is
basically just QSPI and the same byte-oriented protocol as before, that
would be useful to know.

Though, not of much immediate relevance:
Would still need to design a more efficient MMIO interface or a DMA
mechanism or similar before I could gain much additional bandwidth from
such a thing.

The original 8-bit MMIO interface hitting a wall at around 600K/s, and
the current 64-bit MMIO interface having a hard-limit of ~ 4MB/s; vs ~
20-50 MB/s which could theoretically be usable in a QSPI like mode,
provided there is a way to access it. Reading to/from an MMIO buffer
would still be a bottleneck, would need some way to initiate a
block-transfer to/from the L2 cache.

I guess, could theoretically hack it onto my existing MMIO interface,
just beyond just the EMIT and EMIX8X flags, there could be an EMITDMA_R/EMITDMA_W flags or similar, where the 64-bit SPI_QDATA
register is instead used to encode a memory address and a block size,
and the SPI module will read/write the requested block (then maybe add
other flag for SPI vs QSPI/SDR and QSPI/DDR operation, ...).

Terje

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@[email protected] (MitchAlsup1) to comp.arch on Wed May 15 21:41:32 2024

From Newsgroup: comp.arch

BGB wrote:

Seems like probably in a similar area as QSPI RAM.

Quick skim, looks like QSPI RAM access looks something like:
Pull CS low;
Send command byte;
Send address bytes (4);
Send/receive data bytes;
CS goes high when transfer is done;
CS going high apparently puts the chip back in its idle state.

If you do a 16-byte burst, this would be ~ 1.4 cycles (DDR) per data
byte, or 2.8 cycles if driving it from a faster SDR clock. A datasheet
for a random QSPI RAM chip I found suggests it has a maximum operating frequency of around 54 MHz (so, a little lower than the DDR chips), and
a lot are apparently "pseudo static" (they are DRAM internally, but also perform their own RAM refresh, appearing as SRAM from the POV of the external bus interface).

You are forgetting that DRAM RAS occurs after the first 2 address bytes
are latched, and that CAS occurs after the second 2 address bits are
latched {and that you are in a deRAS deCAS state already.}

QSPI at 54 Mhz is just under 20ns per DRAM address/command event not that
much different than current DDRs

But what current DDRs can do is to partially overlap address/command with
data transfer--I would suspect QSPI could do this too.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@[email protected] to comp.arch on Thu May 16 01:23:20 2024

From Newsgroup: comp.arch

On Wed, 15 May 2024 21:41:32 +0000
[email protected] (MitchAlsup1) wrote:

BGB wrote:

Seems like probably in a similar area as QSPI RAM.

Quick skim, looks like QSPI RAM access looks something like:
Pull CS low;
Send command byte;
Send address bytes (4);
Send/receive data bytes;
CS goes high when transfer is done;
CS going high apparently puts the chip back in its idle state.

If you do a 16-byte burst, this would be ~ 1.4 cycles (DDR) per
data byte, or 2.8 cycles if driving it from a faster SDR clock. A
datasheet for a random QSPI RAM chip I found suggests it has a
maximum operating frequency of around 54 MHz (so, a little lower
than the DDR chips), and a lot are apparently "pseudo static" (they
are DRAM internally, but also perform their own RAM refresh,
appearing as SRAM from the POV of the external bus interface).

You are forgetting that DRAM RAS occurs after the first 2 address
bytes are latched, and that CAS occurs after the second 2 address
bits are latched {and that you are in a deRAS deCAS state already.}

We are talking about 1978 here. Back then, it was 7-bit raw addres
and 7-bit column addresss. I don't know how they applied address bits
above A13. Later on there were chip select and/or output enable signals,
but circa-1978 16-bit DRAM had neither of those. So, it seems, if one
wanted to put more than 16 KB on 8-bit bus, one had to generated
multiple sets of RAS# and CAS# signals.

Since then # of rows, first per DRAM chip and later per bank, grew tremendously, but number of columns only grew by factor of 8 and
remains the same for more than 20 years.

QSPI at 54 Mhz is just under 20ns per DRAM address/command event not
that much different than current DDRs

But what current DDRs can do is to partially overlap address/command
with data transfer--I would suspect QSPI could do this too.

Partially?
With typical CL=14==28T, and with burst that just recently was
increased to 16T and before that stayed at 8T for more than decade, I'd
say that it's full overlap ++.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Schultz@[email protected] to comp.arch on Wed May 15 17:35:58 2024

From Newsgroup: comp.arch

On 5/15/24 3:30 PM, BGB wrote:

In my case, I hadn't found much information about accessing SDcards in anything other than SPI mode, but I guess if it turns out it is
basically just QSPI and the same byte-oriented protocol as before, that would be useful to know.

The SD card specification covers both in detail.

But the 4 bit wide SD access will depend a lot on what hardware support
you have. I used an ARM (in a Teensy 3.2) to do the job. Mostly quite
similar to SPI access.
--
http://davesrocketworks.com
David Schultz

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@[email protected] to comp.arch on Wed May 15 18:48:44 2024

From Newsgroup: comp.arch

On 5/15/2024 5:35 PM, David Schultz wrote:

On 5/15/24 3:30 PM, BGB wrote:

In my case, I hadn't found much information about accessing SDcards in
anything other than SPI mode, but I guess if it turns out it is
basically just QSPI and the same byte-oriented protocol as before,
that would be useful to know.

The SD card specification covers both in detail.

When looking around before, I generally only found people talking about
the SPI interfaces. Granted, I didn't have the official specifications
(which seemed to be paywalled and not generally available otherwise). I
mostly implemented stuff based on information I found on various websites.

Though, looking around some more, apparently the 4-bit interface is
basically the same protocol (at the byte level) as the 1-bit SPI
protocol, but differs mostly in that it is 4-bit SDR (apparently the DDR variants are specific to UHS-I / UHS-II, and the "Full Speed" variant
that is SDR).

Apparently, it is also functionally equivalent to QSPI is most other
regards (well, apart from QSPI RAM/Flash and SDcards having different communication protocols).

But the 4 bit wide SD access will depend a lot on what hardware support
you have. I used an ARM (in a Teensy 3.2) to do the job. Mostly quite similar to SPI access.

I am mostly using FPGA's here, so in theory shouldn't be too much effort
to add QSPI support.

Bigger challenge is figuring out how to best modify the design of the
SPI MMIO interface to be able to make use of the higher data transfer
speeds.

Though most immediate solution would probably just be to add some more
MMIO registers to increase the data-transfer size; though this quickly
turns into diminishing returns.

Another possibility being to widen the MMIO Bus interface to 128 bits.

Mostly, the limiting factor is that it takes roughly 24 clock cycles for
every access to the MMIO bus in my case.

Though, looks like increasing the transfer size to 32 bytes (via adding
more MMIO registers and a few more control bits) would increase the
bottleneck to around 14MB/s, which looks like mostly enough to make
effective use of SDcard 4-bit / Full Speed mode...

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Schultz@[email protected] to comp.arch on Wed May 15 19:49:34 2024

From Newsgroup: comp.arch

On 5/15/24 6:48 PM, BGB wrote:

When looking around before, I generally only found people talking about
the SPI interfaces. Granted, I didn't have the official specifications (which seemed to be paywalled and not generally available otherwise). I mostly implemented stuff based on information I found on various websites.

Parts of it are/were paywalled but the interesting bit hasn't been. At
least as far back as SD 1.0.

https://www.sdcard.org/downloads/pls/

I am not sure which one of those is the one to look at. Probably the one
with the 9.1 revision level.
--
http://davesrocketworks.com
David Schultz

--- Synchronet 3.20a-Linux NewsLink 1.114

From EricP@[email protected] to comp.arch on Thu May 16 00:21:28 2024

From Newsgroup: comp.arch

Michael S wrote:

We are talking about 1978 here. Back then, it was 7-bit raw addres
and 7-bit column addresss. I don't know how they applied address bits
above A13. Later on there were chip select and/or output enable signals,
but circa-1978 16-bit DRAM had neither of those. So, it seems, if one
wanted to put more than 16 KB on 8-bit bus, one had to generated
multiple sets of RAS# and CAS# signals.

Yes. I'm looking at a 1979 Motorola Memory book and both the
16k*1 MCM4116A and 64k*1 MCM6664 were 16 pin packages without
chip select. You controlled them by gating the RAS and CAS.
At access time after CAS the output switches from high-z to valid data,
so at least you can just wire the outputs together.

The 4116A's were a pain because they required power of +12V, +5V, -5V, GND
(but so did the 8080) but the power supplies often only produce +5V
so you had to use voltage pumps for +12V and -5V, and those pumps were
really inefficient and got really hot, so now you needed a fan too.

The 6664's only required 5V and GND.

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@[email protected] to comp.arch on Thu May 16 16:21:02 2024

From Newsgroup: comp.arch

On 5/15/2024 7:49 PM, David Schultz wrote:

On 5/15/24 6:48 PM, BGB wrote:

When looking around before, I generally only found people talking
about the SPI interfaces. Granted, I didn't have the official
specifications (which seemed to be paywalled and not generally
available otherwise). I mostly implemented stuff based on information
I found on various websites.

Parts of it are/were paywalled but the interesting bit hasn't been. At
least as far back as SD 1.0.

https://www.sdcard.org/downloads/pls/

I am not sure which one of those is the one to look at. Probably the one with the 9.1 revision level.

Yeah. Had found the 8.0 spec, which seemed to have some useful information.

Well, and some SDcard features which I don't really get; like apparently
it is possible to route PCIe over the SDcard interface, guessing
probably only some cards would support this. Like, would this be so
someone could stick an SDcard in an M.2 adapter or similar?...

But, yeah, for modifications to my MMIO interface, I went with:
Adding QDATA1..QDATA3 registers, which can increase the transfer size to
32 bytes. Technically, these are using misaligned MMIO addresses, which
are admittedly a route I have ended up going in a few cases (I had
originally only given ~ 16B of MMIO space to the device).

Say:
FFFF_F000E030: Status/Control
FFFF_F000E034: DataB (8-bit transfer)
FFFF_F000E038: DataQ (64-bit transfer)
FFFF_F000E040: Well, PS/2 Keyboard sits here...
FFFF_F000E050: PS/2 Mouse...
...

So, ended up redefining things:
FFFF_F000E038: DataQ / DataQ0 (64 or 256 bit)
FFFF_F000E039: DataQ1 (256-bit transfer)
FFFF_F000E03A: DataQ2 (256-bit transfer)
FFFF_F000E03B: DataQ3 (256-bit transfer)

The status/control register was used to check the status of SPI
transfers, and to accept commands for sending data over SPI.

Using larger transfers is partly because access to the MMIO interface
has a roughly 24 cycle latency, and using a bigger transfer size reduces
the amount of accesses that would be spent on checking the BUSY status
or writing to the Control register for bulk data.

Added bits to status/control:
QSPI: Signals to use a QSPI transfer (vs SPI)
READ: Signals a READ operation (vs SWAP or WRITE)
DDR: Signals the use of DDR vs SDR
XMIT32B: Send/Receive 32 bytes.
With a few other bits having existed:
XMIT8B: Send/receive 8 bytes.
XMIT: Send/Receive 1 byte.
BUSY: SPI is still busy.

The high-order bits of the Status/Control register are used to encode a
clock division relative to the base-clock.

In this case, say:
XMITxx by itself will try to send data though CMD and DATA0 pins.
Using DATA3 as the CS pin.
XMITxx+READ: Will try to read data via the CMD pin;
XMITxx+QSPI: Will send data via DATA0..DATA3;
XMITxx+QSPI+READ: Will receive data via DATA0..DATA3;
...

Note that the routing of the signals to the SDcard pins is via glue
logic external to the module itself (its output signals are more or less
plain QSPI).

Base XMITxx would be used for SPI mode, and for writing commands.

Seems I was wrong about something before:
The combination of CMD0+CS is what selects SPI mode (for normal SD mode,
the CS signal would not be asserted).
I was mistaken in thinking that it was the stream of FF bytes and then asserting CS near the end (say, the init procedure for the SDcard
involving sending a large number of FF bytes at a fairly low speed, then sending some other commands and boosting the speed to the intended
operating speed).

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Schultz@[email protected] to comp.arch on Thu May 16 18:05:03 2024

From Newsgroup: comp.arch

On 5/16/24 4:21 PM, BGB-Alt wrote:

Seems I was wrong about something before:
The combination of CMD0+CS is what selects SPI mode (for normal SD mode,
the CS signal would not be asserted).
I was mistaken in thinking that it was the stream of FF bytes and then asserting CS near the end (say, the init procedure for the SDcard
involving sending a large number of FF bytes at a fairly low speed, then sending some other commands and boosting the speed to the intended
operating speed).

The low speed was a holdover from its previous life as the MMC card
standard. (I still have a 64MB MMC card.) Which were intended to be a multi-drop solution. As such some of the signal lines were open-drain/collector so the speed was limited by the pullup resistor and parasitic capacitance.
--
http://davesrocketworks.com
David Schultz

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@[email protected] to comp.arch on Thu May 16 21:10:05 2024

From Newsgroup: comp.arch

On 5/16/2024 6:05 PM, David Schultz wrote:

On 5/16/24 4:21 PM, BGB-Alt wrote:

Seems I was wrong about something before:
The combination of CMD0+CS is what selects SPI mode (for normal SD
mode, the CS signal would not be asserted).
I was mistaken in thinking that it was the stream of FF bytes and then
asserting CS near the end (say, the init procedure for the SDcard
involving sending a large number of FF bytes at a fairly low speed,
then sending some other commands and boosting the speed to the
intended operating speed).

The low speed was a holdover from its previous life as the MMC card standard. (I still have a 64MB MMC card.) Which were intended to be a multi-drop solution. As such some of the signal lines were open-drain/collector so the speed was limited by the pullup resistor and parasitic capacitance.

Yeah.

Set to 400kHz, send ~ 8K or so worth of FF bytes, send CMD0, etc, then
boost up to 12.5 MHz or similar.

Originally, I was using 5 MHz, but did speed things up. I ended up using
LZ compression on lots of stuff, as 5MHz was fairly slow (and LZ
compression made it faster).

Initially, going beyond 5MHz wasn't done also as this was basically the
same as the bottleneck for the original MMIO interface (so I went to a
higher speed around the same time that I widened transfers on the MMIO interface from 8 bits to 64 bits).

Theoretically, stuff should work at 25MHz, but generally I had not found operation much over 12.5 MHz to be reliable, 16.7 MHz tended to be
unreliable, and 25 MHz didn't work.

Divider logic also limits things, say:
50MHz base clock allows for:
25MHz, 16.7MHz, 12.5MHz, 10MHz, 8.3MHz, ...
DDR operation will not currently be possible over 12.5 MHz.

Here, we need at least 4 cycles for the state-machine logic to do its
thing for DDR to work.

Granted, there was some uncertainty as I also tended to use a microSD to full-size SD adapter cable, using the full-size SD cards for testing
(where potentially the adapter cable could be reducing what clock-speeds
are usable). These adaptors use a small PCB that plugs into the microSD
slot, a piece of flat-flex ribbon cable extending from this PCB (~ 8
inches long), and a full-size SDcard interface on the other end, with in
this case the flat-flex seemingly soldered directly to the PCBs.

Mostly this was because microSD cards are too small to handle
effectively (say, if you drop a full size SDcard, it doesn't just
disappear in the carpet or similar).

But, the FPGA boards generally only come with MicroSD cards for whatever reason.

Mostly using 16GB UHS-I cards, as IIRC they were fairly affordable on
Amazon at the time (and didn't need much bigger for my projects).

Originally I went for SPI as it was free to use, but the native
("Default Speed" / SDR) mode should now be OK as any patents on this
will have presumably expired (and the 4x faster speeds could be worthwhile).

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@[email protected] to comp.arch on Sat May 18 15:38:30 2024

From Newsgroup: comp.arch

On 5/15/2024 4:41 PM, MitchAlsup1 wrote:

BGB wrote:

Seems like probably in a similar area as QSPI RAM.

Quick skim, looks like QSPI RAM access looks something like:
   Pull CS low;
   Send command byte;
   Send address bytes (4);
   Send/receive data bytes;
   CS goes high when transfer is done;
     CS going high apparently puts the chip back in its idle state.

If you do a 16-byte burst, this would be ~ 1.4 cycles (DDR) per data
byte, or 2.8 cycles if driving it from a faster SDR clock. A datasheet
for a random QSPI RAM chip I found suggests it has a maximum operating
frequency of around 54 MHz (so, a little lower than the DDR chips),
and a lot are apparently "pseudo static" (they are DRAM internally,
but also perform their own RAM refresh, appearing as SRAM from the POV
of the external bus interface).

You are forgetting that DRAM RAS occurs after the first 2 address bytes
are latched, and that CAS occurs after the second 2 address bits are
latched {and that you are in a deRAS deCAS state already.}

QSPI at 54 Mhz is just under 20ns per DRAM address/command event not that much different than current DDRs

But what current DDRs can do is to partially overlap address/command with data transfer--I would suspect QSPI could do this too.

I suspect QSPI could not do this, as it does not have separate pins for command and data.

IOW:
CS, CLK, D0..D3

Seemingly, it seems to be using linear addresses rather than row/column
(so, unlike DDRx); seemingly with 32-bit addresses despite most of the
chips probably would be able to get by with 24 (maybe it was "less bad"
to waste an extra byte on the address than to have separate versions of
the messaging protocol, or separate 24/32-bit command variants?...).

Though, SDcard has separate command and data pins in SD mode, I guess
this leaves open whether one can send commands in the middle of a block read/write. Probably not worth it though as with 512 byte blocks, it
probably wouldn't make much difference (unless the SDcard had a fairly
large internal access latency or something).

...

--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online
Recent Visitors
- Fluid
  Wed May 29 18:19:13 2024
  from Wickliffe, Oh via Telnet
- Microbot
  Thu May 30 15:28:26 2024
  from Moore, Ok via Telnet
- Microbot
  Fri May 31 13:50:02 2024
  from Moore, Ok via Telnet
- Microbot
  Sat Jun 1 11:59:50 2024
  from Moore, Ok via Telnet
System Info

Sysop: DaiTengu

Location: Appleton, WI

Users: 762

Nodes: 10 (0 / 10)

Uptime: 103:24:01

Calls: 12,295

Calls today: 1

Files: 186,558

Messages: 2,254,824

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	762
Nodes:	10 (0 / 10)
Uptime:	103:24:01
Calls:	12,295
Calls today:	1
Files:	186,558
Messages:	2,254,824

Article on the 8088 bus cycle

Who's Online

Recent Visitors

System Info