There's also the realization that computer memory except for a few >specialized Forth chips is always made from RAM. So ideological
devotion to a pure stack VM seems to pass up perfectly good hardware >capabilities.
Gforth does support address-like locals if you want to use them.
With competent Forth compilers, the machine code is 1) the same when
using stack operations, when using the return stack, or when using
locals
If you want to use a language that is "ideologically devoted" to the >architecture, maybe you shouldn't use Forth at all - and stick with C.
I know there are situations when there are six values on the data stack
and four on the return stack which leave you with few other options. But
you can always use vanilla variables or an extra stack (which is trivial
to implement) to remedy that.
Using Forth means being resourceful. Not to choose the most convenient
and lazy solution imaginable.
I don't see anything about C that is closer to the hardware than Forth
is, and I think that both languages are about equally '"ideologically devoted" to the architecture'. In particular, a C local variable is
no closer to a register (the most efficient hardware feature for
storing data) than a stack item or return stack item is, and register allocation of any of the three is similarly difficult...
[email protected] (Anton Ertl) writes:
With competent Forth compilers, the machine code is 1) the same when
using stack operations, when using the return stack, or when using
locals
"Competent Forth compilers" there describes what by Forth standards
would be called quite fancy optimizing compilers ("analytic compilers").
They are a significant technical feat and there aren't that many of
them. Traditionally Forth has been implemented as simple interpreters.
r 1->0 third 1->2 >l >l 1->1 dup 1->1mov -$08[r14],r13 mov r15,$10[r10] >l 1->1 mov [r10],r13
2->1 add r14,$08 mov rax,rbp mov rbx,[r14]mov -$08[r10],r15 mov rax,[rbx] lea rbp,-$08[rbp] add r14,$08
In that case, a pure stack VM seems to ignore capabilities of the
underlying hardware. Particularly, the the stack's memory actually
being RAM.
Doesn't PICK go back to the earliest days of Forth, as a way
to bypass the limitation?
I believe early C compilers didn't attempt much if any register
allocation.
The
difference was that the C compiler generated straightforward assembly
code to access those variables even when they were in the stack
interior. You didn't have to use ROT or juggle stuff to the R stack to
get to the inner elements.
Forth for whatever reason
chose strict stack discipline (with some loopholes like PICK). I
understand wanting to stay with purity of a model, but a more >hardware-sympathetic model would have been "stack implemented in RAM".
So I still don't understand the benefit of the "pure abstract stack" >approach, other than for a few weird special CPU's.
[email protected] (Anton Ertl) writes:
I don't see anything about C that is closer to the hardware than Forth
is, and I think that both languages are about equally '"ideologically
devoted" to the architecture'. In particular, a C local variable is
no closer to a register (the most efficient hardware feature for
storing data) than a stack item or return stack item is, and register
allocation of any of the three is similarly difficult...
I believe early C compilers didn't attempt much if any register
allocation. You could say "register int x" to manually assign a
register to x if one was available. You were limited to 2 or 3 of those
on the PDP-11. Local variables in C otherwise lived in the stack. The >difference was that the C compiler generated straightforward assembly
code to access those variables even when they were in the stack
interior. You didn't have to use ROT or juggle stuff to the R stack to
get to the inner elements.
In assembler, you could also program in a stack-oriented style yet >straightforwardly access the inner elements. Forth for whatever reason
chose strict stack discipline (with some loopholes like PICK). I
understand wanting to stay with purity of a model, but a more >hardware-sympathetic model would have been "stack implemented in RAM".
So I still don't understand the benefit of the "pure abstract stack" >approach, other than for a few weird special CPU's.
locals
with without ratio
max 3.56us 2.69us 1.32
strcmp 83.20us 70.50us 1.18
- anton
String handling and move operation are the exception, because
they are both simpler and faster in low level.
Simpler is the argument (especially for i86).
Faster is the bonus.
Hans Bezemer <[email protected]> writes:
I don't see anything about C that is closer to the hardware than Forth
is, and I think that both languages are about equally '"ideologically devoted" to the architecture'. In particular, a C local variable is
no closer to a register (the most efficient hardware feature for
storing data) than a stack item or return stack item is, and register allocation of any of the three is similarly difficult (with big
differences in difficulty between solutions that provide some register allocation to those that are so reliable that you usually count on
them).
Using Forth means being resourceful. Not to choose the most convenient
and lazy solution imaginable.
According to <https://www.dictionary.com/browse/resourceful>:
|able to deal skillfully and promptly with new situations,
|difficulties, etc.
Forth systems that do not implement locals are not a new situation.
So do you mean to say that it is a difficulty?
But blaming the programmer for the system implementor's failings is a
tactic used widely by system implementors (in the C world as well as
in the Forth world).
(..) and they often find some arguments that appeal to
elitism (i.e., only the chosen ones can use this programming language
for the elite as it should be used, and the others should program in
Python or "should never have been allowed to touch a keyboard" (Ulrich Drepper).
In any case, why should it be better to use an inconvenient solution
that requires more work rather than a convenient solution that
requires less work (i.e., is lazy)?
For me virtues in programming are to produce correct code, to produce
it quickly, the code should use the resources economically (which does
not mean that saving a few bytes on a machine with GBs of memory is
virtuos), and the code should be readable and maintainable.
[email protected] writes:
String handling and move operation are the exception, because
they are both simpler and faster in low level.
Simpler is the argument (especially for i86).
Faster is the bonus.
In other words, Forth without locals is not well suited for words
that have so much active data. That is also reflected in hardware
designed for Forth, which got additional registers like A or B (or
additional capabilities for the top of the return stack register R),
which make it simpler and faster to implement such words.
A definition of STRCMP in the paper is
: strcmp { addr1 u1 addr2 u2 -- n }
addr1 addr2
u1 u2 min 0
?do { s1 s2 }
s1 c@ s2 c@ - ?dup
if
unloop exit
then
s1 char+ s2 char+
loop
2drop
u1 u2 - ;
So in the loop we have a loop count (on the return stack), two cursors
(s1 and s2) into the compared strings, and within the loop body we additionally have the two characters, for a total of five live values,
three of which survive across iterations and are changed in every
iteration. One could implement it as
\ untested, and the following versions, too
: strcmp { addr1 u1 addr2 u2 -- n }
addr1 addr2
u1 u2 min 0
?do
addr1 i + c@ addr2 i + c@ - ?dup
if
unloop exit
then
loop
u1 u2 - ;
where only one of the values changes in each iteration, but now the ?DO...LOOP cannot be replaced with a version that does not store a
second value but counts down (or up) to 0, so now we have a total of 6
live values, four of which survive across iterations, and one is
changed on every iteration.
One can reduce this by one value by keeping one of the cursors in the
loop counter:
: strcmp {: addr1 u1 addr2 u2 -- n :}
addr2 addr1 - {: offset :}
u1 u2 min addr1 + addr1 ?do
i c@ i offset + c@ - ?dup
if
unloop exit
then
loop
u1 u2 - ;
So now we have five live values in the body of the loop at the same
time, three of which live across iterations, and one of which changes
in each iteration. Keeping the loop parameters separate significantly lessens the load on the data stack.
Let's see if we can eliminate the local from the loop body:
: strcmp {: addr1 u1 addr2 u2 -- n :}
addr2 addr1 - ( offset )
u1 u2 min addr1 + addr1 ?do ( offset )
dup i + c@ i c@ - ?dup
if
nip unloop exit
then
loop
drop u1 u2 - ;
That leaves stack purists with the task of eliminating the locals from
the prologue and epilogue of this word. Two items have to be stored
across the loop, or the difference could be computed speculatively and
only one item stored across the loop. And the computations before the
loop involve four values alive at the same time (fortunately addr2 is
does not live long). Let's see:
: strcmp {: addr1 u1 addr2 u2 -- n :}
rot 2dup - >r ( addr1 addr2 u1 u2 R: n1 )
min -rot over - ( u12 addr1 offset R: n1 )
swap rot bounds ( offset limit start R: n1 )
?do ( offset R: n1 loop-sys )
dup i + c@ i c@ - ?dup
if
nip unloop r> drop exit
then
loop
drop r> negate ;
As can be seen by the many stack comments, the stack load here is more
than I can easily deal with.
Maybe a stack purist can improve on that. But can he improve it
enough to make it as easy to understand as any of the versions with
locals?
- anton
On Sat, 25 Apr 2026 10:22:16 GMT
[email protected] (Anton Ertl) wrote:
[email protected] writes:
String handling and move operation are the exception, because
they are both simpler and faster in low level.
Simpler is the argument (especially for i86).
Faster is the bonus.
In other words, Forth without locals is not well suited for words
that have so much active data. That is also reflected in hardware
designed for Forth, which got additional registers like A or B (or
additional capabilities for the top of the return stack register R),
which make it simpler and faster to implement such words.
A definition of STRCMP in the paper is
: strcmp { addr1 u1 addr2 u2 -- n }
addr1 addr2
u1 u2 min 0
?do { s1 s2 }
s1 c@ s2 c@ - ?dup
if
unloop exit
then
s1 char+ s2 char+
loop
2drop
u1 u2 - ;
So in the loop we have a loop count (on the return stack), two cursors
(s1 and s2) into the compared strings, and within the loop body we
additionally have the two characters, for a total of five live values,
three of which survive across iterations and are changed in every
iteration. One could implement it as
\ untested, and the following versions, too
: strcmp { addr1 u1 addr2 u2 -- n }
addr1 addr2
u1 u2 min 0
?do
addr1 i + c@ addr2 i + c@ - ?dup
if
unloop exit
then
loop
u1 u2 - ;
where only one of the values changes in each iteration, but now the
?DO...LOOP cannot be replaced with a version that does not store a
second value but counts down (or up) to 0, so now we have a total of 6
live values, four of which survive across iterations, and one is
changed on every iteration.
One can reduce this by one value by keeping one of the cursors in the
loop counter:
: strcmp {: addr1 u1 addr2 u2 -- n :}
addr2 addr1 - {: offset :}
u1 u2 min addr1 + addr1 ?do
i c@ i offset + c@ - ?dup
if
unloop exit
then
loop
u1 u2 - ;
So now we have five live values in the body of the loop at the same
time, three of which live across iterations, and one of which changes
in each iteration. Keeping the loop parameters separate significantly
lessens the load on the data stack.
Let's see if we can eliminate the local from the loop body:
: strcmp {: addr1 u1 addr2 u2 -- n :}
addr2 addr1 - ( offset )
u1 u2 min addr1 + addr1 ?do ( offset )
dup i + c@ i c@ - ?dup
if
nip unloop exit
then
loop
drop u1 u2 - ;
That leaves stack purists with the task of eliminating the locals from
the prologue and epilogue of this word. Two items have to be stored
across the loop, or the difference could be computed speculatively and
only one item stored across the loop. And the computations before the
loop involve four values alive at the same time (fortunately addr2 is
does not live long). Let's see:
: strcmp {: addr1 u1 addr2 u2 -- n :}
rot 2dup - >r ( addr1 addr2 u1 u2 R: n1 )
min -rot over - ( u12 addr1 offset R: n1 )
swap rot bounds ( offset limit start R: n1 )
?do ( offset R: n1 loop-sys )
dup i + c@ i c@ - ?dup
if
nip unloop r> drop exit
then
loop
drop r> negate ;
As can be seen by the many stack comments, the stack load here is more
than I can easily deal with.
Maybe a stack purist can improve on that. But can he improve it
enough to make it as easy to understand as any of the versions with
locals?
I recently reviewed the string comparison for search-wordlist
and came up with the following
The string stored in the word header is already uppercased.
So string comparison will be case insensitive
: UC ( c -- c' ) \ uppercase char
dup $61 $7B within $20 and - ;
: NCOMP4 ( addr n addr' n' - f) \ 0 is match
dup >r
begin
rot = while \ str cstr
r> dup 1- >r
while \ str cstr
swap count uc \ cstr str' s1
rot count \ str' s1 cstr' c1
repeat
2drop r> drop 0 exit
then
2drop r> drop 1 ;
First iteration in the loop it does not compare chars but the length!
BR
Peter
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,114 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 492511:59:06 |
| Calls: | 14,267 |
| Calls today: | 3 |
| Files: | 186,320 |
| D/L today: |
26,259 files (8,509M bytes) |
| Messages: | 2,518,394 |