• locals (was: Coroutines in Forth)

    From anton@[email protected] (Anton Ertl) to comp.lang.forth on Sat Apr 25 04:47:12 2026
    From Newsgroup: comp.lang.forth

    Paul Rubin <[email protected]d> writes:
    There's also the realization that computer memory except for a few >specialized Forth chips is always made from RAM. So ideological
    devotion to a pure stack VM seems to pass up perfectly good hardware >capabilities.

    With competent Forth compilers, the machine code is 1) the same when
    using stack operations, when using the return stack, or when using
    locals, and 2) no RAM access is happens (unless the compiler runs out
    of registers). This is demonstrated by lxf on the 3DUP variants <[email protected]>; to spare you having to
    look this posting up, here's the relevant part:

    |: 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
    |: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
    |: 3dup.3 {: a b c :} a b c a b c ;
    |: 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;
    |
    |These four ways of expressing 3DUP are all compiled to exactly the
    |same code by lxf/ntf:
    |
    | 804FC0A 8B4500 mov eax , [ebp]
    | 804FC0D 8945F4 mov [ebp-Ch] , eax
    | 804FC10 8B4504 mov eax , [ebp+4h]
    | 804FC13 8945F8 mov [ebp-8h] , eax
    | 804FC16 895DFC mov [ebp-4h] , ebx
    | 804FC19 8D6DF4 lea ebp , [ebp-Ch]
    | 804FC1C C3 ret near

    That leads to the questions in this discussion:

    1) Should we optimize for less competent compilers? Why?

    a) If yes, should we optimize all code, or only the part of the
    code that is actually executed frequently?

    2) Are there other criteria for deciding between the alternatives?
    Which ones?

    Gforth does support address-like locals if you want to use them.

    Gforth has provided variable-flavoured locals since I implemented
    locals (in 1994), because I had the idea that using ! is preferable to
    using TO, but in practice I did not use variable-flavoured locals, and
    instead preferred to avoid TO by defining locals where their value is
    known, and then just using them (possibly defining additional locals
    instead of using TO on existing locals). And AFAIK others have rarely
    used variable-flavoured locals, either.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Paul Rubin@[email protected] to comp.lang.forth on Fri Apr 24 23:21:28 2026
    From Newsgroup: comp.lang.forth

    [email protected] (Anton Ertl) writes:
    With competent Forth compilers, the machine code is 1) the same when
    using stack operations, when using the return stack, or when using
    locals

    "Competent Forth compilers" there describes what by Forth standards
    would be called quite fancy optimizing compilers ("analytic compilers").
    They are a significant technical feat and there aren't that many of
    them. Traditionally Forth has been implemented as simple interpreters.

    In that case, a pure stack VM seems to ignore capabilities of the
    underlying hardware. Particularly, the the stack's memory actually
    being RAM. Doesn't PICK go back to the earliest days of Forth, as a way
    to bypass the limitation?
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.lang.forth on Sat Apr 25 05:26:47 2026
    From Newsgroup: comp.lang.forth

    Hans Bezemer <[email protected]> writes:
    If you want to use a language that is "ideologically devoted" to the >architecture, maybe you shouldn't use Forth at all - and stick with C.

    I don't see anything about C that is closer to the hardware than Forth
    is, and I think that both languages are about equally '"ideologically
    devoted" to the architecture'. In particular, a C local variable is
    no closer to a register (the most efficient hardware feature for
    storing data) than a stack item or return stack item is, and register allocation of any of the three is similarly difficult (with big
    differences in difficulty between solutions that provide some register allocation to those that are so reliable that you usually count on
    them).

    Given the stuff I read about Chuck Moore's goals in designing Forth
    and what I read about the development of BCPL, B, and C, it's not too surprising that they are close to the hardware of the time when they
    were designed. It is interesting that both Forth and C standards
    (and, to some extent, implementations) have not reflected newer
    architectural features such as SIMD instructions. At least they
    managed to reflect different machine-word sizes (BLISS didn't,
    resulting in differences between BLISS-10, BLISS-11, and BLISS-32, and
    its losing against C despite having superior compilers for more than a
    decade.

    I know there are situations when there are six values on the data stack
    and four on the return stack which leave you with few other options. But
    you can always use vanilla variables or an extra stack (which is trivial
    to implement) to remedy that.

    Using Forth means being resourceful. Not to choose the most convenient
    and lazy solution imaginable.

    According to <https://www.dictionary.com/browse/resourceful>:

    |able to deal skillfully and promptly with new situations,
    |difficulties, etc.

    Forth systems that do not implement locals are not a new situation.
    So do you mean to say that it is a difficulty? I would agree. That's
    fine if you are using a tiny system and do not want to use an umbilical/tethered system, but if the system is big enough to support
    locals, lack of locals of the system shows the lazyness of the system implementor.

    But blaming the programmer for the system implementor's failings is a
    tactic used widely by system implementors (in the C world as well as
    in the Forth world), and they often find some arguments that appeal to
    elitism (i.e., only the chosen ones can use this programming language
    for the elite as it should be used, and the others should program in
    Python or "should never have been allowed to touch a keyboard" (Ulrich Drepper)), and enough people fall for this that they repeat such
    arguments and come up with additional arguments of this kind.

    In any case, why should it be better to use an inconvenient solution
    that requires more work rather than a convenient solution that
    requires less work (i.e., is lazy)?

    For me virtues in programming are to produce correct code, to produce
    it quickly, the code should use the resources economically (which does
    not mean that saving a few bytes on a machine with GBs of memory is
    virtuos), and the code should be readable and maintainable.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Paul Rubin@[email protected] to comp.lang.forth on Fri Apr 24 23:55:16 2026
    From Newsgroup: comp.lang.forth

    [email protected] (Anton Ertl) writes:
    I don't see anything about C that is closer to the hardware than Forth
    is, and I think that both languages are about equally '"ideologically devoted" to the architecture'. In particular, a C local variable is
    no closer to a register (the most efficient hardware feature for
    storing data) than a stack item or return stack item is, and register allocation of any of the three is similarly difficult...

    I believe early C compilers didn't attempt much if any register
    allocation. You could say "register int x" to manually assign a
    register to x if one was available. You were limited to 2 or 3 of those
    on the PDP-11. Local variables in C otherwise lived in the stack. The difference was that the C compiler generated straightforward assembly
    code to access those variables even when they were in the stack
    interior. You didn't have to use ROT or juggle stuff to the R stack to
    get to the inner elements.

    In assembler, you could also program in a stack-oriented style yet straightforwardly access the inner elements. Forth for whatever reason
    chose strict stack discipline (with some loopholes like PICK). I
    understand wanting to stay with purity of a model, but a more hardware-sympathetic model would have been "stack implemented in RAM".

    So I still don't understand the benefit of the "pure abstract stack"
    approach, other than for a few weird special CPU's.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.lang.forth on Sat Apr 25 06:43:23 2026
    From Newsgroup: comp.lang.forth

    Paul Rubin <[email protected]d> writes:
    [email protected] (Anton Ertl) writes:
    With competent Forth compilers, the machine code is 1) the same when
    using stack operations, when using the return stack, or when using
    locals

    "Competent Forth compilers" there describes what by Forth standards
    would be called quite fancy optimizing compilers ("analytic compilers").
    They are a significant technical feat and there aren't that many of
    them. Traditionally Forth has been implemented as simple interpreters.

    And traditionally Forth has been implemented without locals, for the
    same reason: It takes less memory and, for the system implementor,
    less work; on current non-tiny machines, the latter aspect still
    exists, and IMO is a big motivation for anti-locals advocacy (i.e., a sour-grapes argument).

    It's a bit perverse: You argue for locals with simple implementations,
    while anti-locals advocates argue against locals with simple
    implementations.

    And because it's more work, there are fewer sophisticated than simple
    systems. But who cares how many there are? The question is what
    programmers and users use and what their goals are.

    In any case, when it comes to performance measurements on "simple
    interpreters" like the Gforth of 1994, Forth code with locals usually
    turns out to be slower and consume more memory than Forth code using
    (and trying to avoid) stack juggling. E.g., my paper [ertl94l]
    contains the following comparison:

    locals
    with without ratio
    max 3.56us 2.69us 1.32
    strcmp 83.20us 70.50us 1.18

    Numbers from a 486DX2/66, strcmp compares a string with 17 characters
    with itself.

    The explanation given is:

    |The slowdown factor of using locals is due to the execution of more |primitives (e.g., 14 instead of 12 per character in
    |"strcmp"). Originally there was also a large overhead due to fetching
    |inline arguments, resulting in slowdowns of 1.58 for "max" and 1.41
    |for "strcmp". This overhead has been eliminated mostly by using
    |versions of the primitives specialized for frequent inline arguments
    |(e.g., "8lp+!" as specialization of "lp+!#" with the inline
    |argument 8).

    @InProceedings{ertl94l,
    author = "M. Anton Ertl",
    title = "Automatic Scoping of Local Variables",
    booktitle = "EuroForth~'94 Conference Proceedings",
    year = "1994",
    address = "Winchester, UK",
    pages = "31--37",
    url = "https://www.complang.tuwien.ac.at/papers/ertl94l.ps.gz",
    abstract = "In the process of lifting the restrictions on using
    locals in Forth, an interesting problem poses
    itself: What does it mean if a local is defined in a
    control structure? Where is the local visible? Since
    the user can create every possible control structure
    in ANS Forth, the answer is not as simple as it may
    seem. Ideally, the local is visible at a place if
    the control flow {\em must} pass through the
    definition of the local to reach this place. This
    paper discusses locals in general, the visibility
    problem, its solution, the consequences and the
    implementation as well as related programming style
    questions."
    }

    It might be interesting to measure this again on current hardware with
    the current, somewhat more sophisticated, but not yet "competent"
    Gforth, and maybe I will, at some other time. However, looking at the
    code for Gforth for 3DUP.3 compared to the ohers, Gforth still uses
    more primitives (even with superinstructions) and more machine
    instructions; From <[email protected]>:

    : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
    : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
    : 3dup.3 {: a b c :} a b c a b c ;
    : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

    And here's the gforth-fast code on AMD64:

    3dup.1 3dup.2 3dup.3 3dup.4
    r 1->0 third 1->2 >l >l 1->1 dup 1->1
    mov -$08[r14],r13 mov r15,$10[r10] >l 1->1 mov [r10],r13
    sub r14,$08 third 2->3 mov -$08[rbp],r13 sub r10,$08 2dup 0->2 mov r9,$08[r10] mov rdx,$08[r10] 2over 1->3
    mov r13,$10[r10] third 3->1 mov rax,rbp mov r15,$18[r10
    mov r15,$08[r10] mov [r10],r13 add r10,$10 mov r9,$10[r10]
    i 2->3 sub r10,$18 lea rbp,-$10[rbp] rot 3->1
    mov r9,[r14] mov $10[r10],r15 mov -$10[rax],rdx mov [r10],r15 -rot 3->2 mov $08[r10],r9 mov r13,[r10] sub r10,$10
    mov [r10],r9 ;s 1->1 >l @local0 1->1 mov $08[r10],r9
    sub r10,$08 mov rbx,[r14] @local0 1->1 ;s 1->1
    2->1 add r14,$08 mov rax,rbp mov rbx,[r14]
    mov -$08[r10],r15 mov rax,[rbx] lea rbp,-$08[rbp] add r14,$08
    sub r10,$10 jmp eax mov -$08[rax],r13 mov rax,[rbx]
    mov $10[r10],r13 @local1 1->2 jmp eax
    mov r13,[r14] mov r15,$08[rbp]
    add r14,$08 @local2 2->1
    ;s 1->1 mov -$08[r10],r15
    mov rbx,[r14] sub r10,$10
    add r14,$08 mov $10[r10],r13
    mov rax,[rbx] mov r13,$10[rbp]
    jmp eax @local0 1->2
    mov r15,$00[rbp]
    @local1 2->3
    mov r9,$08[rbp]
    @local2 3->1
    mov -$10[r10],r9
    sub r10,$18
    mov $10[r10],r15
    mov $18[r10],r13
    mov r13,$10[rbp]
    lit 1->2
    #24
    mov r15,$50[rbx]
    lp+! 2->1
    add rbp,r15
    ;s 1->1
    mov rbx,[r14]
    add r14,$08
    mov rax,[rbx]
    jmp eax

    [Note that for a superinstruction like ">l >l" or ">l @local0", all
    threaded code cells are shown, the first as superinstruction, and the
    remaining ones as the simple primitive in that threaded-code slot; but
    the other threaded-code slots have no separate code generated.]

    You seem to argue that the random-access aspect of locals provides a performance advantage on simple systems, but in most cases, code using
    locals is at a performance disadvantage on such systems (and
    traditionalists have often used that to argue against locals).

    In that case, a pure stack VM seems to ignore capabilities of the
    underlying hardware. Particularly, the the stack's memory actually
    being RAM.

    Keeping at least one stack item in a register leads to a smaller and
    faster implementation, and is not more complex than keeping all the
    stack memory in RAM. It does require enough registers, however (i.e.,
    you do not use this technique on the 6502).

    Doesn't PICK go back to the earliest days of Forth, as a way
    to bypass the limitation?

    A way to use RAM that is less frowned upon by Forth traditionalists is
    (global) variables. The fact that the use of global variables is
    frowned upon in the wider programming community for various reasons
    seems to pour oil into the fire of their elitism.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.lang.forth on Sat Apr 25 08:21:41 2026
    From Newsgroup: comp.lang.forth

    Paul Rubin <[email protected]d> writes:
    I believe early C compilers didn't attempt much if any register
    allocation.

    Yes, they did not allocate auto variables (what we consider locals) to registers.

    The
    difference was that the C compiler generated straightforward assembly
    code to access those variables even when they were in the stack
    interior. You didn't have to use ROT or juggle stuff to the R stack to
    get to the inner elements.

    That's the same with unsophisticated locals implementations like those
    of Gforth (I do not mention other Forth systems with such
    implementations to protect the guilty).

    Forth for whatever reason
    chose strict stack discipline (with some loopholes like PICK). I
    understand wanting to stay with purity of a model, but a more >hardware-sympathetic model would have been "stack implemented in RAM".

    What do you mean by that? Forth already provides PICK. ROLL and
    -ROLL are either slow to implement in RAM or require significant sophistication. In addition, Gforth has

    : stick ( x0 x1 ... xu x u -- x x1 ... xu ) \ gforth-internal
    \ replace x0 with x; e.g., 5 PICK 1+ 5 STICK increments the 6th
    \ stack element (not recommended).
    2 + cells sp@ + ! ;

    which is used in the Gforth source code 7 times (compared to 20 times
    for PICK, 4 for FOURTH, 38 for THIRD, 308 for OVER and 1128 for DUP),
    always with colon-sys-xt-offset as U, so STICK is only used to
    manipulate colon-sys control-flow stack items. I have also had little
    appetite to use it elsewhere.

    In general, in Forth programming one copies things from various places
    in stacks with DUP, OVER, PICK, and R@; sometimes you do not need the
    item in its original place any more, then you SWAP, ROT or ROLL it
    instead of keeping it on the stack and dropping it later (and the item
    might be in the way). Very occasuinally, you copy an item deeper into
    the stack, as with TUCK, or -ROT or -ROLL it out of the way.

    But overwriting an existing stack item with something else as done by
    STICK is not something we tend to do, and this also shows in the
    absence of such words for the top few stack items (while 1 PICK is
    OVER, there is no word that corresponds to 1 STICK). I think the
    reason why it is not done is that we avoid keeping dead stack items
    around that we might overwrite. Such dead stack items would often be
    in the way.

    And if someone has the desire for having a storage location that they
    want to overwrite, Forth has locals (although I avoid overwriting
    them, too, see <https://net2o.de/gforth/Locals-programming-style.html>).

    So I still don't understand the benefit of the "pure abstract stack" >approach, other than for a few weird special CPU's.

    The benefit of not implementing locals is that implementing the Forth
    system takes less time and the resulting system is smaller.

    PICK tends to be frowned upon because it is a code small that suggests
    that you have too much going on on the stack, which makes the program
    hard to understand, and you should be looking for alternatives.

    ROLL and -ROLL are avoided for the same reason and because they are
    slow on many implementations.

    As for STICK, see above.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From albert@[email protected] to comp.lang.forth on Sat Apr 25 11:27:36 2026
    From Newsgroup: comp.lang.forth

    In article <[email protected]>,
    Paul Rubin <[email protected]d> wrote:
    [email protected] (Anton Ertl) writes:
    I don't see anything about C that is closer to the hardware than Forth
    is, and I think that both languages are about equally '"ideologically
    devoted" to the architecture'. In particular, a C local variable is
    no closer to a register (the most efficient hardware feature for
    storing data) than a stack item or return stack item is, and register
    allocation of any of the three is similarly difficult...

    I believe early C compilers didn't attempt much if any register
    allocation. You could say "register int x" to manually assign a
    register to x if one was available. You were limited to 2 or 3 of those
    on the PDP-11. Local variables in C otherwise lived in the stack. The >difference was that the C compiler generated straightforward assembly
    code to access those variables even when they were in the stack
    interior. You didn't have to use ROT or juggle stuff to the R stack to
    get to the inner elements.

    In assembler, you could also program in a stack-oriented style yet >straightforwardly access the inner elements. Forth for whatever reason
    chose strict stack discipline (with some loopholes like PICK). I
    understand wanting to stay with purity of a model, but a more >hardware-sympathetic model would have been "stack implemented in RAM".

    There are more loopholes, once you think of it.
    Suppose you have a recursive integration algorithm. Define an object
    that contains all relevant recursive data. Allocate it on the data
    stack ( DSP@ size - DSP! ) and make it the current object (DSP@ ^recdat !)
    Free the stack once you're done ( DSP@ size + DSP! ) .
    In this context you are using normal float, not weird locals, and .
    your choice of normal or single floats.
    [ More politically correct is probably to ALLOCATE FREE for no
    clear benefit. ]


    So I still don't understand the benefit of the "pure abstract stack" >approach, other than for a few weird special CPU's.


    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From albert@[email protected] to comp.lang.forth on Sat Apr 25 11:43:30 2026
    From Newsgroup: comp.lang.forth

    In article <[email protected]>,
    Anton Ertl <[email protected]> wrote:
    <SNIP>

    locals
    with without ratio
    max 3.56us 2.69us 1.32
    strcmp 83.20us 70.50us 1.18

    Interestingly, I don't allow complicated definitions with assembler implementations in ciforth.
    E.g. + XOR 0< EXECUTE are all low level, not much more.
    String handling and move operation are the exception, because
    they are both simpler and faster in low level.
    Simpler is the argument (especially for i86).
    Faster is the bonus.

    <SNIP>

    - anton

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.lang.forth on Sat Apr 25 10:22:16 2026
    From Newsgroup: comp.lang.forth

    [email protected] writes:
    String handling and move operation are the exception, because
    they are both simpler and faster in low level.
    Simpler is the argument (especially for i86).
    Faster is the bonus.

    In other words, Forth without locals is not well suited for words
    that have so much active data. That is also reflected in hardware
    designed for Forth, which got additional registers like A or B (or
    additional capabilities for the top of the return stack register R),
    which make it simpler and faster to implement such words.

    A definition of STRCMP in the paper is

    : strcmp { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do { s1 s2 }
    s1 c@ s2 c@ - ?dup
    if
    unloop exit
    then
    s1 char+ s2 char+
    loop
    2drop
    u1 u2 - ;

    So in the loop we have a loop count (on the return stack), two cursors
    (s1 and s2) into the compared strings, and within the loop body we
    additionally have the two characters, for a total of five live values,
    three of which survive across iterations and are changed in every
    iteration. One could implement it as

    \ untested, and the following versions, too
    : strcmp { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do
    addr1 i + c@ addr2 i + c@ - ?dup
    if
    unloop exit
    then
    loop
    u1 u2 - ;

    where only one of the values changes in each iteration, but now the
    ?DO...LOOP cannot be replaced with a version that does not store a
    second value but counts down (or up) to 0, so now we have a total of 6
    live values, four of which survive across iterations, and one is
    changed on every iteration.

    One can reduce this by one value by keeping one of the cursors in the
    loop counter:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - {: offset :}
    u1 u2 min addr1 + addr1 ?do
    i c@ i offset + c@ - ?dup
    if
    unloop exit
    then
    loop
    u1 u2 - ;

    So now we have five live values in the body of the loop at the same
    time, three of which live across iterations, and one of which changes
    in each iteration. Keeping the loop parameters separate significantly
    lessens the load on the data stack.

    Let's see if we can eliminate the local from the loop body:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - ( offset )
    u1 u2 min addr1 + addr1 ?do ( offset )
    dup i + c@ i c@ - ?dup
    if
    nip unloop exit
    then
    loop
    drop u1 u2 - ;

    That leaves stack purists with the task of eliminating the locals from
    the prologue and epilogue of this word. Two items have to be stored
    across the loop, or the difference could be computed speculatively and
    only one item stored across the loop. And the computations before the
    loop involve four values alive at the same time (fortunately addr2 is
    does not live long). Let's see:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    rot 2dup - >r ( addr1 addr2 u1 u2 R: n1 )
    min -rot over - ( u12 addr1 offset R: n1 )
    swap rot bounds ( offset limit start R: n1 )
    ?do ( offset R: n1 loop-sys )
    dup i + c@ i c@ - ?dup
    if
    nip unloop r> drop exit
    then
    loop
    drop r> negate ;

    As can be seen by the many stack comments, the stack load here is more
    than I can easily deal with.

    Maybe a stack purist can improve on that. But can he improve it
    enough to make it as easy to understand as any of the versions with
    locals?

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Hans Bezemer@[email protected] to comp.lang.forth on Sat Apr 25 15:43:06 2026
    From Newsgroup: comp.lang.forth

    On 25-04-2026 07:26, Anton Ertl wrote:
    Hans Bezemer <[email protected]> writes:

    I don't see anything about C that is closer to the hardware than Forth
    is, and I think that both languages are about equally '"ideologically devoted" to the architecture'. In particular, a C local variable is
    no closer to a register (the most efficient hardware feature for
    storing data) than a stack item or return stack item is, and register allocation of any of the three is similarly difficult (with big
    differences in difficulty between solutions that provide some register allocation to those that are so reliable that you usually count on
    them).

    Well, you're actually shooting at Paul Rubin - not at me. Thank you! I
    take all the help I can get!

    Using Forth means being resourceful. Not to choose the most convenient
    and lazy solution imaginable.

    According to <https://www.dictionary.com/browse/resourceful>:

    |able to deal skillfully and promptly with new situations,
    |difficulties, etc.

    That's EXACTLY what I meant!

    Forth systems that do not implement locals are not a new situation.
    So do you mean to say that it is a difficulty?

    You're completely beside the point I wanted to make. I meant the design
    or algorithm one has to implement.

    But blaming the programmer for the system implementor's failings is a
    tactic used widely by system implementors (in the C world as well as
    in the Forth world).

    YAGNI is not a "system implementers failing". It is a choice he made,
    because you (a) really don't need it - or (b) if you need it you can add
    it yourself. Which all seems very Forth like.

    (..) and they often find some arguments that appeal to
    elitism (i.e., only the chosen ones can use this programming language
    for the elite as it should be used, and the others should program in
    Python or "should never have been allowed to touch a keyboard" (Ulrich Drepper).

    It's your own pal Bernd that said: "A good programmer will write even
    better code in Forth. A bad programmer will write abysmal code in Forth.
    And I'm sorry to say - but most programmers are quite bad."

    So, either you agree with him or we have an unfortunate departure of one
    of the most foremost members of Gforth. Because this states - in no
    uncertain words - that Forth programmers *ARE* elite.

    Which in itself is a defensible position. I mean - we're 0.1% of the programming population according to TIOBE. I blame it soley on our
    inability to procreate, but you may put up some other viable explanation.

    Moore himself thinks we're elite: "I must say that I'm appalled at the
    code I see. Because all this code suffers the same failings, I conclude
    it's not a sporadic problem."

    I mean - there is nothing wrong from being a subpar programmer. Plenty
    of languages to choose from - and still get bread on the table.

    Of course, it's expected that one states that "All humans are equal -
    even if they're programming". That's the time we live in.

    But I quote Jan Cremer, a famous Dutch writer: "'I'm okay and you're
    okay.' That sounds quite nice. But 'I'm okay and you're a dick' feels
    much better."

    Humanity can be divided in four groups:
    1. Those who can not write Forth;
    2. Those who tried Forth, but failed;
    3. Those who pretend to write Forth, but still fail;
    4. Those who can write Forth.

    I mean: the truth must be said. I'm Dutch. I can't help myself.

    In any case, why should it be better to use an inconvenient solution
    that requires more work rather than a convenient solution that
    requires less work (i.e., is lazy)?

    It would be better to think deeply, find an original solution and learn.
    Like Albert with his brilliant ;: word.

    For me virtues in programming are to produce correct code, to produce
    it quickly, the code should use the resources economically (which does
    not mean that saving a few bytes on a machine with GBs of memory is
    virtuos), and the code should be readable and maintainable.

    Well, to me it's something different. Who cares what you or I think.
    It's about what you can prove decisively.

    Hans Bezemer

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From peter@[email protected] to comp.lang.forth on Sat Apr 25 16:07:47 2026
    From Newsgroup: comp.lang.forth

    On Sat, 25 Apr 2026 10:22:16 GMT
    [email protected] (Anton Ertl) wrote:

    [email protected] writes:
    String handling and move operation are the exception, because
    they are both simpler and faster in low level.
    Simpler is the argument (especially for i86).
    Faster is the bonus.

    In other words, Forth without locals is not well suited for words
    that have so much active data. That is also reflected in hardware
    designed for Forth, which got additional registers like A or B (or
    additional capabilities for the top of the return stack register R),
    which make it simpler and faster to implement such words.

    A definition of STRCMP in the paper is

    : strcmp { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do { s1 s2 }
    s1 c@ s2 c@ - ?dup
    if
    unloop exit
    then
    s1 char+ s2 char+
    loop
    2drop
    u1 u2 - ;

    So in the loop we have a loop count (on the return stack), two cursors
    (s1 and s2) into the compared strings, and within the loop body we additionally have the two characters, for a total of five live values,
    three of which survive across iterations and are changed in every
    iteration. One could implement it as

    \ untested, and the following versions, too
    : strcmp { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do
    addr1 i + c@ addr2 i + c@ - ?dup
    if
    unloop exit
    then
    loop
    u1 u2 - ;

    where only one of the values changes in each iteration, but now the ?DO...LOOP cannot be replaced with a version that does not store a
    second value but counts down (or up) to 0, so now we have a total of 6
    live values, four of which survive across iterations, and one is
    changed on every iteration.

    One can reduce this by one value by keeping one of the cursors in the
    loop counter:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - {: offset :}
    u1 u2 min addr1 + addr1 ?do
    i c@ i offset + c@ - ?dup
    if
    unloop exit
    then
    loop
    u1 u2 - ;

    So now we have five live values in the body of the loop at the same
    time, three of which live across iterations, and one of which changes
    in each iteration. Keeping the loop parameters separate significantly lessens the load on the data stack.

    Let's see if we can eliminate the local from the loop body:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - ( offset )
    u1 u2 min addr1 + addr1 ?do ( offset )
    dup i + c@ i c@ - ?dup
    if
    nip unloop exit
    then
    loop
    drop u1 u2 - ;

    That leaves stack purists with the task of eliminating the locals from
    the prologue and epilogue of this word. Two items have to be stored
    across the loop, or the difference could be computed speculatively and
    only one item stored across the loop. And the computations before the
    loop involve four values alive at the same time (fortunately addr2 is
    does not live long). Let's see:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    rot 2dup - >r ( addr1 addr2 u1 u2 R: n1 )
    min -rot over - ( u12 addr1 offset R: n1 )
    swap rot bounds ( offset limit start R: n1 )
    ?do ( offset R: n1 loop-sys )
    dup i + c@ i c@ - ?dup
    if
    nip unloop r> drop exit
    then
    loop
    drop r> negate ;

    As can be seen by the many stack comments, the stack load here is more
    than I can easily deal with.

    Maybe a stack purist can improve on that. But can he improve it
    enough to make it as easy to understand as any of the versions with
    locals?

    I recently reviewed the string comparison for search-wordlist
    and came up with the following

    The string stored in the word header is already uppercased.
    So string comparison will be case insensitive

    : UC ( c -- c' ) \ uppercase char
    dup $61 $7B within $20 and - ;


    : NCOMP4 ( addr n addr' n' - f) \ 0 is match
    dup >r
    begin
    rot = while \ str cstr
    r> dup 1- >r
    while \ str cstr
    swap count uc \ cstr str' s1
    rot count \ str' s1 cstr' c1
    repeat
    2drop r> drop 0 exit
    then
    2drop r> drop 1 ;

    First iteration in the loop it does not compare chars but the length!

    BR
    Peter



    - anton


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Hans Bezemer@[email protected] to comp.lang.forth on Sat Apr 25 17:38:11 2026
    From Newsgroup: comp.lang.forth

    On 25-04-2026 16:07, peter wrote:
    On Sat, 25 Apr 2026 10:22:16 GMT
    [email protected] (Anton Ertl) wrote:

    [email protected] writes:
    String handling and move operation are the exception, because
    they are both simpler and faster in low level.
    Simpler is the argument (especially for i86).
    Faster is the bonus.

    In other words, Forth without locals is not well suited for words
    that have so much active data. That is also reflected in hardware
    designed for Forth, which got additional registers like A or B (or
    additional capabilities for the top of the return stack register R),
    which make it simpler and faster to implement such words.

    A definition of STRCMP in the paper is

    : strcmp { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do { s1 s2 }
    s1 c@ s2 c@ - ?dup
    if
    unloop exit
    then
    s1 char+ s2 char+
    loop
    2drop
    u1 u2 - ;

    So in the loop we have a loop count (on the return stack), two cursors
    (s1 and s2) into the compared strings, and within the loop body we
    additionally have the two characters, for a total of five live values,
    three of which survive across iterations and are changed in every
    iteration. One could implement it as

    \ untested, and the following versions, too
    : strcmp { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do
    addr1 i + c@ addr2 i + c@ - ?dup
    if
    unloop exit
    then
    loop
    u1 u2 - ;

    where only one of the values changes in each iteration, but now the
    ?DO...LOOP cannot be replaced with a version that does not store a
    second value but counts down (or up) to 0, so now we have a total of 6
    live values, four of which survive across iterations, and one is
    changed on every iteration.

    One can reduce this by one value by keeping one of the cursors in the
    loop counter:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - {: offset :}
    u1 u2 min addr1 + addr1 ?do
    i c@ i offset + c@ - ?dup
    if
    unloop exit
    then
    loop
    u1 u2 - ;

    So now we have five live values in the body of the loop at the same
    time, three of which live across iterations, and one of which changes
    in each iteration. Keeping the loop parameters separate significantly
    lessens the load on the data stack.

    Let's see if we can eliminate the local from the loop body:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - ( offset )
    u1 u2 min addr1 + addr1 ?do ( offset )
    dup i + c@ i c@ - ?dup
    if
    nip unloop exit
    then
    loop
    drop u1 u2 - ;

    That leaves stack purists with the task of eliminating the locals from
    the prologue and epilogue of this word. Two items have to be stored
    across the loop, or the difference could be computed speculatively and
    only one item stored across the loop. And the computations before the
    loop involve four values alive at the same time (fortunately addr2 is
    does not live long). Let's see:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    rot 2dup - >r ( addr1 addr2 u1 u2 R: n1 )
    min -rot over - ( u12 addr1 offset R: n1 )
    swap rot bounds ( offset limit start R: n1 )
    ?do ( offset R: n1 loop-sys )
    dup i + c@ i c@ - ?dup
    if
    nip unloop r> drop exit
    then
    loop
    drop r> negate ;

    As can be seen by the many stack comments, the stack load here is more
    than I can easily deal with.

    Maybe a stack purist can improve on that. But can he improve it
    enough to make it as easy to understand as any of the versions with
    locals?

    I recently reviewed the string comparison for search-wordlist
    and came up with the following

    The string stored in the word header is already uppercased.
    So string comparison will be case insensitive

    : UC ( c -- c' ) \ uppercase char
    dup $61 $7B within $20 and - ;


    : NCOMP4 ( addr n addr' n' - f) \ 0 is match
    dup >r
    begin
    rot = while \ str cstr
    r> dup 1- >r
    while \ str cstr
    swap count uc \ cstr str' s1
    rot count \ str' s1 cstr' c1
    repeat
    2drop r> drop 0 exit
    then
    2drop r> drop 1 ;

    First iteration in the loop it does not compare chars but the length!

    BR
    Peter

    This one is about a third bigger than yours - if we disregard the "UC",
    that is:

    : comp
    rot over - if drop 2drop true exit then
    0 ?do
    over i chars + c@ over i chars + c@ -
    if drop drop unloop true exit then
    loop drop drop false
    ;

    In 4tH, it is even visually more compact:

    : comp
    rot over - if drop 2drop true ;then
    0 ?do over i [] c@ over i [] c@ - if drop drop unloop true ;then loop
    drop drop false
    ;

    The extra length comes mainly from the three different possible exits:
    - It's not the same size (first line);
    - It's not the same content (exit within loop);
    - It's the same thing (after loop).

    I can't say I particularly like the use of "COUNT" here - because it
    actually represents "C@+" - except for the first run. Neither am I very
    happy with the BEGIN..WHILE..WHILE..REPEAT..THEN construct - but that's
    not your fault ;-)

    All that being said, I cannot deny it is a clever piece of code using
    the full capabilities of the language, bravo!

    Hans Bezemer
    --- Synchronet 3.21f-Linux NewsLink 1.2