• Re: Unicode in strings

    From Stefan Monnier@[email protected] to comp.arch on Tue May 14 12:24:31 2024
    From Newsgroup: comp.arch

    Assume you're implementing a language which has a function of setting
    an individual character in a string.
    That's a design mistake in the language, and I know no language that
    has this misfeature.

    I suspect "individual character" meant "code point" above.
    Does Unicode even has the notion of "character", really?

    Instead, what we see is one language (Python3) that has an even worse misfeature: You can set an individual code point in a string; see
    above for the things you get when you overwrite code points.

    I think it's fairly common for languages that started with strings
    as "arrays of 8bit chars".

    Emacs Lisp has this misfeature as well (and so does Common Lisp). 🙁
    It's really hard to get rid of it, even though it's used *very* rarely.
    In ELisp, strings are represented internally as utf-8 (tho it pretends
    to be an array opf code points), so an assignment that replaces a single
    char can require reallocating the array!

    But why would one want to set individual code points?

    Because you know your string only contains "characters" made of a single
    code point?

    E.g. your string contains the representation of the border of a table
    (to be displayed in a tty), and you want to "move" the `+` of a column separator (or a prettier version that takes advantage of the wider
    choice offered by Unicode).


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@[email protected] (MitchAlsup1) to comp.arch on Tue May 14 17:43:43 2024
    From Newsgroup: comp.arch

    Anton Ertl wrote:

    Thomas Koenig <[email protected]> writes:

    E.g., consider the following Gforth code (others can tell you how to
    do it in Python):

    "Ko\u0308nig" cr type

    The output is:

    König

    That is, the second character consists of two Unicode code points, the
    "o" and the "\u0308" (Combining Diaeresis).

    (I think that somewhere along the way from the Forth system to the
    xterm through copying and pasting into Emacs the second character has
    become precomposed, but that's probably just as well, so you can see
    what I see).

    If I replace the third code point with an e, I get "Koenig". So by overwriting one code point, I insert a character into the string.

    If instead I replace the second code point with a "\u0316" (Combining
    Grave Accent Below):

    "K\u0316\u0308nig" cr type

    I get this (which looks as expected in my xterm, but not in Emacs)

    K̖̈nig

    The first character is now a K with a diaresis above and an accent
    grave below and there are now a total of 4 characters, but still 6
    code points in the string; the second character has been deleted by
    this code-point replacement.


    It seems to me (in my vast ignorance) that names for things should be
    written in the most appropriate set of characters in the language of
    the person/thing being named.

    Then when such a name is "sent out to be displayed" that it is a property
    of the display what character set(s) it can properly emit, and thereby
    alter the string of characters as appropriate to its capabilities.

    For example:: Take > "K\u0316\u0308nig" cr type ==> K̖̈nig
    When displayed on a ASCII only line printer it would be written Koenig
    When displayed on a enhanced ASCII printer it would be written König
    When displayed on a full functional printer it would be written K̖̈nig

    The problem is the mapping function between how it should be encoded
    in its own native language to what can be expressed on a particular
    device.

    Only the display device needs to understand this mapping and NOT the program/software/device holding the string.

    I think people in Japan should be able to use printf by using プリントフ There is way to much "english" in the way computers are being used.
    It is similar to Anthropomorphizing animal behavior.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@[email protected] to comp.arch on Tue May 14 20:35:37 2024
    From Newsgroup: comp.arch

    On 14/05/2024 19:43, MitchAlsup1 wrote:

    I think people in Japan should be able to use printf by using プリントフ There is way to much "english" in the way computers are being used.

    I disagree entirely here.

    For many things, international consistency is more important than
    picking local-sounding names for things that have no localised meaning.
    Having a Japanese name and spelling for "printf" doesn't give Japanese programmers any useful information, it is not easier to type or read,
    and simply ensures that they can't cooperate and collaborate with
    programmers using different languages. MS Office uses local languages
    for its macros and formulas in Excel - I've never heard anyone in Norway
    say they like it, and many who say it is a PITA that makes it hard to
    work with and hard to search for information. Most people IME who
    macros a lot prefer to stick to English.

    It works the other way too. When discussing Karate or Judo, most practitioners the world over know what a "mawashi geri" or an "o soto
    gari" is - most consistently use the Japanese terms regardless of native languages. Most, that is, except Americans and some other English
    speakers who feel they have to use English language terms, losing a lot
    of the subtlety and nuances of the terms and being different from their international peers.

    And when people try to force localisation of terms that have no local
    words, the result is just to encourage people to move everything over to
    a single language (English).


    It is similar to Anthropomorphizing animal behavior.

    No, it is not.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@[email protected] to comp.arch on Tue May 14 20:47:12 2024
    From Newsgroup: comp.arch

    MitchAlsup1 <[email protected]> schrieb:

    I think people in Japan should be able to use printf by using プリントフ

    I have to put up with a minor version of that - Microsoft decided to
    localize folder names ("Program files" is dislplayed as "Programme"
    if you use German settings, except when you access it via the
    command line), and all Excel functions are localized; depending
    if you use English or German versions, arguments are separated
    via comma or semicolon. Of course, the other way is a syntax error.

    Saving things in native Excel format is OK, but generating a CSV
    file from a program will either work or not, depending on locale
    ("," vs ";" and "." vs ".").

    This is about as annoying as it gets...
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sat May 18 05:29:20 2024
    From Newsgroup: comp.arch

    Stefan Monnier <[email protected]> writes:
    Anton Ertl:]
    Thomas Koenig:]
    Assume you're implementing a language which has a function of setting
    an individual character in a string.
    That's a design mistake in the language, and I know no language that
    has this misfeature.

    I suspect "individual character" meant "code point" above.

    I meant character, not code point, as should have become clear from
    the following. I think that Thomas Koenig meant "character", too, but
    he may have been unaware of the difference between "character" and
    "Unicode code point".

    Does Unicode even has the notion of "character", really?

    AFAIK it does not. But applications like palindrome checkers care
    about characters, not code points.

    OTOH, most code can be implemented fine as working on strings, without
    knowing how many characters there are in the string (and it then does
    not need to know about code points, either). In other words, it can
    be implemented just as well when the strings are represented as
    strings of code units (whether UTF-8 (bytes), UTF-16 (16-bit code
    units) or UTF-32 (32-bit code units)), and then it does not help to
    convert UTF-8 to something else on input and something else to UTF-8
    on output.

    For the code that cares about characters, if it wants to work
    correctly for characters that cannot be precomposed into a single code
    point, it has to deal with characters that consist of multiple code
    points, i.e., that even in UTF-32 are variable-width. So given that
    you have to bite the variable-width bullet anyway, you can just as
    well use UTF-8.

    Instead, what we see is one language (Python3) that has an even worse
    misfeature: You can set an individual code point in a string; see
    above for the things you get when you overwrite code points.

    I think it's fairly common for languages that started with strings
    as "arrays of 8bit chars".

    Apart from Python3 not in those languages that I have looked at more
    closely wrt this feature.

    In particular, C was created by adding a byte type to B, and that type
    was called "char". It was allowed to be wider to cater for
    word-addressed machines, but on byte-addressed machines "char" is
    invariably a byte. To cater to Unicode, they used a two-pronged
    approach: they added wchar_t and multi-byte functions (IIRC both
    already in C89); wchar_t was obviously introduced to cater for the
    upcoming Unicode 1.0 (which satisfied code unit=code point=character),
    while the multibyte stuff was probably introduced originally for
    dealing with the ASCII-compatible East-Asian encodings.

    When UTF-8 arrived, the multi-byte functions proved to fit that well;
    but of course there is not much usage of those functions, because most
    code works fine without knowing about individual code points or
    characters. And UTF-8 turned out to be the answer to dealing with
    Unicode that the Unix programmers who had a lot of code working with
    strings of chars (i.e., bytes) were looking for.

    Then Unicode 2.0 arrived and the Win32 API (which had embraced wchar_t
    and defined it as being 16-bit) stuck with 16-bit wchar_t, which
    breaks "code unit=code point"; this may not be in line with the
    intentions of the inventors of wchar_t (e.g., there are no
    multi-wchar_t functions in the C standard last time I looked), but
    that has been the existing practice in wchar_t use in C for more than
    a quarter-century.

    Unix, where wchar_t was (and still is) little used, switched to 32-bit
    wchar_t, but

    1) given that Unicode at some point (probably already in 2.0) broke
    "code point=character", that does not really help software like
    palindrome checkers.

    2) wchar_t is little-used in Unix-specific code.

    3) Code that wants to be portable between Unix and Windows and uses
    wchar_t cannot rely on "code unit=code point" anyway.

    So, in practice, C code does not make use of the ability to set an
    individual code point by overwriting a fixed-size code unit.

    Forth has chars that are 8 bits wide in traditional Forth systems on byte-addressed machines. In the 1994 standard (in the middle of the
    reign of Unicode 1.0, and with lots of Californians on the
    standardization committe) provided the option to implement Forth
    systems with chars that take a fixed number >1 of bytes, and one
    system (JaxForth by Jack Woehr for Windows NT) implemented 16-bit
    chars.

    However, JaxForth was not very popular, and most code assumed that 1
    char = 1 (i.e., 8 bits on a byte-addressed machine), and given that
    there was no widely available system that deviated from that, even
    code that wanted to avoid this assumption could not be tested. And
    given that most code has this assumption and would not work on systems
    with 1 chars > 1, all the other systems stuck with 1 char = 1. A Chicken-and-Egg problem? Not really:

    When we looked at the problem in 2004, we found that most code works
    fine with UTF-8; that's because most code does not care about
    characters. Even code that uses words like C@ (load a char from
    memory) typically does it in a way that works with UTF-8. We proposed
    a number of words for dealing with variable-width xchars (what C calls multi-byte characters), and you can theoretically use them with the
    pre-Unicode East-Asian encodings as well as with UTF-8. These words
    were standardized in Forth-2012, but they are actually little-used
    (including by me), because most code actually works fine with opaque
    strings.

    In Gforth, an xchar is a code point, not a character, so these words
    are currently less useful for writing Palindrome checkers than one
    might hope. Maybe at some point we will look at the problem again,
    and provide words for dealing with characters, Unicode normalization,
    collating order and such things, but for now the pain is not big
    enough to tackle that problem.

    Finally, I proposed to standardize the common practice 1 chars = 1;
    this proposal was accepted for standardization in 2016.

    Emacs Lisp has this misfeature as well (and so does Common Lisp). 🙁
    It's really hard to get rid of it, even though it's used *very* rarely.
    In ELisp, strings are represented internally as utf-8 (tho it pretends
    to be an array opf code points), so an assignment that replaces a single
    char can require reallocating the array!

    One way forward might be to also provide a string-oriented API with
    byte (code unit) indices, and recommend that people use that instead
    of the inefficient code-point-indexed API. For a high-level language
    like Elisp or Python, the internal representation can depend on which
    function was last used on the string. So if code uses only the
    string-oriented API, you may be able to avoid the costs of the
    code-point API completely.

    But why would one want to set individual code points?

    Because you know your string only contains "characters" made of a single
    code point?

    This incorrect "knowledge" may be the reason why Emacs 27.1 displays

    K̖̈nig

    as if the first three-code-point character actually was three characters.

    E.g. your string contains the representation of the border of a table
    (to be displayed in a tty), and you want to "move" the `+` of a column >separator (or a prettier version that takes advantage of the wider
    choice offered by Unicode).

    These kinds of things involve additional complications. Not only do
    you have to know the difference between code points and characters,
    you also have to know the visual width of a character which is 0-2 for fixed-width fonts to be used in xterm or the like. Actually, if you
    treat a combining mark as having width 0, you may be able to work with
    code points and do not need characters.

    Why do you want to move the column separator and what do you want to
    overwrite with it? This is likely the result of another operation,
    and maybe that involves another string replacement; and displaying the
    result involves so much overhead that using a string replacement
    instead of a fixed-width store is probably not the dominant cost. And
    if the replacement string happens to have as many bytes as the
    replaced string (which would happen for, e.g., replacing " " with
    "+"), the operation is not so expensive anyway.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@[email protected] to comp.arch on Sat May 18 08:29:12 2024
    From Newsgroup: comp.arch

    Anton Ertl <[email protected]> schrieb:
    Stefan Monnier <[email protected]> writes:

    Does Unicode even has the notion of "character", really?

    AFAIK it does not. But applications like palindrome checkers care
    about characters, not code points.

    Considering the huge market for palindrome checkers, that is a
    real concern, especially if they involve characters for which
    UTF-32 is not sufficient, such as smileys.

    Is there any language whose characters cannot be represented in
    UTF-32?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sat May 18 08:40:40 2024
    From Newsgroup: comp.arch

    [email protected] (MitchAlsup1) writes:
    It seems to me (in my vast ignorance) that names for things should be
    written in the most appropriate set of characters in the language of
    the person/thing being named.

    Then when such a name is "sent out to be displayed" that it is a property
    of the display what character set(s) it can properly emit, and thereby
    alter the string of characters as appropriate to its capabilities.

    For example:: Take > "K\u0316\u0308nig" cr type ==> K̖̈nig
    When displayed on a ASCII only line printer it would be written Koenig
    When displayed on a enhanced ASCII printer it would be written König
    When displayed on a full functional printer it would be written K̖̈nig

    Why do you think that K̖̈nig should be written as Koenig or König?

    However, for König Unicode specifies that the precomposed form is
    König. And if you want a transcription into ASCII with the knowledge
    that it's German, the result would be Koenig.

    Only the display device needs to understand this mapping and NOT the >program/software/device holding the string.

    Yes, that's why treating string data as opaque works for most of the
    code.

    I think people in Japan should be able to use printf by using プリントフ >There is way to much "english" in the way computers are being used.

    I don't know how Japanese feel about that, but I certainly don't want
    to have to use some Germanized form of C or Forth. This kind of
    catering for different natural-language programmers has been tried and
    has not taken over the world. I guess that's because

    1) You need to learn a lot about what "printf" means and how it is
    used; remembering the name is only a minor aspect.

    2) Having a name common on all the world allows you to read programs
    from all over the world, use reference material from all over the
    world, etc.

    A similar concept was implemented in COBOL, where the designers though
    that having to write

    ADD A TO B GIVING C

    or somesuch makes programming easier than writing

    C = A+B

    in FORTRAN. Has not found many followers, either. Interestingly,
    among the Algol descendents, the BCPL (and later B and C) syntax,
    which, e.g., replaced 'or' with || or |, and was otherwise more
    symbolic and less natural-language-oriented than its ancestor Algol
    60, was the most successful syntax style among the Algol descendents,
    including spreading to languages like Java that are closer to Algol 60
    or Pascal in other respects.

    I have seen programmers define their own names based on their native
    language, however. But if they use names in their own language, these
    names should not depend on the environment.

    In the macro language of a game I play, you can refer to things
    through their name or through their numeric id. Unfortunately, the
    names are localized, so the only way to write portable macros is by
    using the unmnemonic numeric ids:-(.

    What is more common than localized programming languages is producing
    error messages in localized languages. I find this annoying, too,
    because it makes it harder to find out how others have solved the same
    problem.

    And, e.g., ENOTSUP in Unix, has such a specific meaning that the
    lozalized text does not help the person unfamiliar with Unix, while it
    makes life harder for people who know Unix enough to make sense of the
    message; i.e., even though my native language is German, I find
    "Operation not supported" easier to understand than "Operation wird
    nicht unterstützt"; in the latter case I first have to guess what the
    English error message would have been and then I can start analysing
    the problem.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sat May 18 10:14:44 2024
    From Newsgroup: comp.arch

    Thomas Koenig <[email protected]> writes:
    Anton Ertl <[email protected]> schrieb:
    Stefan Monnier <[email protected]> writes:

    Does Unicode even has the notion of "character", really?

    AFAIK it does not. But applications like palindrome checkers care
    about characters, not code points.

    Considering the huge market for palindrome checkers, that is a
    real concern, especially if they involve characters for which
    UTF-32 is not sufficient, such as smileys.

    Is there any language whose characters cannot be represented in
    UTF-32?

    The goal of Unicode is to support all writng systems; AFAIK they are
    not yet finished, but they expect that these writing systems will all
    fit into the space provided by UTF-16 (i.e., a little over one million
    code points), but they found it necessary to introduce the concept of
    composing glyphs from multiple code points.

    So if your question is: "Is there any language where one character
    cannot be represented by a single Unicode code point?" The answer is
    that the Unicode designers certainly expect that there are such
    writing systems.

    And looking at <https://en.wikipedia.org/wiki/Telugu_script> (just an
    example), I see that the table of Unicode code points for Telugu <https://en.wikipedia.org/wiki/Telugu_script#Unicode> is much smaller
    than the tables of glyphs in <https://en.wikipedia.org/wiki/Telugu_script#Articulation_of_consonants>
    and <https://en.wikipedia.org/wiki/Telugu_script#Consonants_with_vowel_diacritics>, so the Telugu script seems to be one writing system that cannot be
    represented with only precomposed characters.

    I don't know if palindromes are a thing in Telugu, though.

    But, as your reference to the size of the market for palindrome
    checkers indicates, there is actually little code where dealing with
    individual characters is relevant. For code where individual
    characters are not relevant and opaque strings are sufficient, there
    is no reason to use UTF-32. And for code where individual characters
    are relevant, code points are not sufficient in general, so there is
    no reason to use UTF-32 for that, either.

    Interestingly, Emacs 27.1 manages to deal with "తెలుగు లిపి" (which
    contains 6 characters composed of a total of 11 code points) just
    fine, while it fails on König (with a decomposed Umlaut-o).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@[email protected] to comp.arch on Sat May 18 14:09:31 2024
    From Newsgroup: comp.arch

    Anton Ertl <[email protected]> schrieb:
    [email protected] (MitchAlsup1) writes:
    It seems to me (in my vast ignorance) that names for things should be >>written in the most appropriate set of characters in the language of
    the person/thing being named.

    Then when such a name is "sent out to be displayed" that it is a property >>of the display what character set(s) it can properly emit, and thereby >>alter the string of characters as appropriate to its capabilities.

    For example:: Take > "K\u0316\u0308nig" cr type ==> K̖̈nig
    When displayed on a ASCII only line printer it would be written Koenig
    When displayed on a enhanced ASCII printer it would be written König
    When displayed on a full functional printer it would be written K̖̈nig

    Why do you think that K̖̈nig should be written as Koenig or König?

    On my display, this read K, n with a diacritic and something close to
    a cedille under the n.


    However, for König

    Again, the diaresis is over the n, not the o.

    Unicode specifies that the precomposed form is
    König. And if you want a transcription into ASCII with the knowledge
    that it's German, the result would be Koenig.

    This is actually sometimes a (fairly minor) problem because the
    name on my passport actually reads "König" (o-diacritic), but
    people without knowledge of German tend to translscribe this as
    "Konig", whereas I transcribe it as "Koenig" on offical forms
    such as the one I need to fill out prior to entering the US.

    This is why modern EU passports have a canonical form of the
    name, which then is "KOENIG".
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@[email protected] to comp.arch on Sat May 18 16:25:54 2024
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Anton Ertl <[email protected]> schrieb:
    [email protected] (MitchAlsup1) writes:
    It seems to me (in my vast ignorance) that names for things should be>>> written in the most appropriate set of characters in the language of
    the person/thing being named.

    Then when such a name is "sent out to be displayed" that it is a property >>> of the display what character set(s) it can properly emit, and thereby
    alter the string of characters as appropriate to its capabilities.

    For example:: Take > "K\u0316\u0308nig" cr type ==> K̖̈nig
    When displayed on a ASCII only line printer it would be written Koenig
    When displayed on a enhanced ASCII printer it would be written König
    When displayed on a full functional printer it would be written K̖̈nig

    Why do you think that K̖̈nig should be written as Koenig or König?

    On my display, this read K, n with a diacritic and something close to
    a cedille under the n.


    However, for König

    Again, the diaresis is over the n, not the o.

    Unicode specifies that the precomposed form is
    König. And if you want a transcription into ASCII with the knowledge
    that it's German, the result would be Koenig.

    This is actually sometimes a (fairly minor) problem because the
    name on my passport actually reads "König" (o-diacritic), but
    people without knowledge of German tend to translscribe this as
    "Konig", whereas I transcribe it as "Koenig" on offical forms
    such as the one I need to fill out prior to entering the US.

    This is why modern EU passports have a canonical form of the
    name, which then is "KOENIG".

    Same problem as my wife and kids who have Norløff either a part of their surname or (my wife) as-is.
    Canonical simplification of the 'ø' character is either 'o' or 'oe', and passports and airline tickets differ, something which can cause all
    sorts of issues with US passport control.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@[email protected] to comp.arch on Sat May 18 14:41:04 2024
    From Newsgroup: comp.arch

    Terje Mathisen <[email protected]> schrieb:

    Canonical simplification of the 'ø' character is either 'o' or 'oe', and passports and airline tickets differ, something which can cause all
    sorts of issues with US passport control.

    Reminds me of either "Asterix and the Great Crossing" or "Asterix
    and the Normans", where Viking speach was indicated by having
    slashes through letters (like ø). When Obelix tries to speak
    their language, he also applies slashes, but does so randomly
    (like through a c) so nobody can understand him.

    Hmm... a challenge, can this be represented as Unicode codepoints?
    I would not be surprised if some Asterix fan had snuck it in while
    nobody was looking.

    (For those who don't know Asterix: It is a comic that was/is wildly
    popular in France and Germany at least, about Gauls who keep on
    resisting Roman occupation in the times of Julius Caesar, aided
    by a magic potion which gives them superhuman strength.)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sat May 18 15:43:05 2024
    From Newsgroup: comp.arch

    Thomas Koenig <[email protected]> writes:
    Anton Ertl <[email protected]> schrieb:
    Why do you think that K̖̈nig should be written as Koenig or König?

    On my display, this read K, n with a diacritic and something close to
    a cedille under the n.

    That displays correctly then. The "close to cedille" is an accent
    grave below.

    However, for König

    Again, the diaresis is over the n, not the o.

    That's strage, in the first case your display system composes the
    diaresis correctly with the preceding glyph (at that point, a K with
    accent grave below), but in the o case, it incorrectly composes it
    with the next glyph.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sat May 18 15:48:35 2024
    From Newsgroup: comp.arch

    Thomas Koenig <[email protected]> writes:
    Terje Mathisen <[email protected]> schrieb:

    Canonical simplification of the 'ø' character is either 'o' or 'oe', and >> passports and airline tickets differ, something which can cause all
    sorts of issues with US passport control.

    Reminds me of either "Asterix and the Great Crossing" or "Asterix
    and the Normans", where Viking speach was indicated by having
    slashes through letters (like ø). When Obelix tries to speak
    their language, he also applies slashes, but does so randomly
    (like through a c) so nobody can understand him.

    Hmm... a challenge, can this be represented as Unicode codepoints?

    Sure. See <https://en.wikipedia.org/wiki/Bar_(diacritic)>.
    Interestingly, the Obelix character ȼ you mention above has it's own precomposed code point U+023C (Latin Small Letter C with Stroke) and
    its own Wikipedia page: https://en.wikipedia.org/wiki/%C8%BB, but you
    can also compose it from c and the combining short solidus overlay: c̷
    (this does not display correctly on emacs 27.1, but composes correctly
    on an xterm. There is no precomposed Latin Small Letter D with
    Stroke, but you can compose it in the same way: d̷.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Levine@[email protected] to comp.arch on Sat May 18 17:09:44 2024
    From Newsgroup: comp.arch

    According to Thomas Koenig <[email protected]>:
    Considering the huge market for palindrome checkers, that is a
    real concern, especially if they involve characters for which
    UTF-32 is not sufficient, such as smileys.

    Is there any language whose characters cannot be represented in
    UTF-32?

    Chinese. There is a huge backlog of obscure but real Chinse characters
    that do not have a Unicode code point. This ISO committee is slowly
    working through them. Every couple of years they approve a batch of
    several thousand of them.

    https://en.wikipedia.org/wiki/Ideographic_Research_Group
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Fuld@[email protected] to comp.arch on Sat May 18 17:11:32 2024
    From Newsgroup: comp.arch

    Anton Ertl wrote:


    snip


    A similar concept was implemented in COBOL, where the designers though
    that having to write

    ADD A TO B GIVING C

    or somesuch makes programming easier than writing

    C = A+B

    in FORTRAN.


    I would put a slightly different spin on it. I believe that the
    original COBOL was designed not so much to make programming easier, but
    to make *learning* programming (for non-programmers) easier, and
    because it was supposedly "self documenting", easier for managers, etc.
    to see how the program worked. Remember, when COBOL was developed
    (late 1950s), there weren't many programmers in existance, and it was
    felt that the "mathematical" syntax of Fortran, would be too unfamiliar
    to the business people who developed the new programs to solve business problems, and who were generally not mathematicians.

    Of course, they were wrong about "self documenting", and as more people
    became programmers, the advantages of consice syntax made a big
    difference.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@[email protected] to comp.arch on Sun May 19 15:32:49 2024
    From Newsgroup: comp.arch

    On Tue, 14 May 2024 17:43:43 +0000, [email protected] (MitchAlsup1)
    wrote:

    I think people in Japan should be able to use printf by using ?????
    There is way to much "english" in the way computers are being used.
    It is similar to Anthropomorphizing animal behavior.

    One could quibble.

    If Japanese people needed to enter kana from their keyboards to write
    programs, that would be awkward; there is not yet a good way to enter
    that kind of text from a keyboard.

    However, I think your point is valid. At least in some contexts.

    Remember back in the early 8-bit days of computing, and before them,
    when schools were exposing children to PDP-8 computers?

    Children were learning to program computers in BASIC.

    Obviously, here, if children in other countries used modified versions
    of BASIC that used keywords in their own natural language, it would be
    much easier for them to get started with programming than if the
    keywords were simply arbitrary strings of letters, taken from a
    foreign language of which they may not necessarily have any knowledge.

    If Algol was supposed to be an _international_ algorithmic language,
    why weren't its keywords taken from Latin or Esperanto, instead of
    English?

    Historical note: Algol was originally called IAL; remember what JOVIAL
    stood for.

    But the objections about sharing code between countries, and the fact
    that English is so widely known in technical circles, are also true.
    It is a complicated issue, made worse by the fact that nationalism and ethnocentricism are often bad things.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@[email protected] to comp.arch on Sun May 19 15:36:45 2024
    From Newsgroup: comp.arch

    On Sat, 18 May 2024 17:11:32 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

    and
    because it was supposedly "self documenting", easier for managers, etc.
    to see how the program worked.

    Of course, if they designed COBOL that way, why did they include a
    statement that let you re-direct GOTO statements from elsewhere in a
    program?

    I mean, that was just *asking* for dishonest programmers to direct the
    odd pennies into their bank accounts and so on.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@[email protected] (Anton Ertl) to comp.arch on Mon May 20 11:46:20 2024
    From Newsgroup: comp.arch

    John Savard <[email protected]d> writes:
    Remember back in the early 8-bit days of computing, and before them,
    when schools were exposing children to PDP-8 computers?

    Children were learning to program computers in BASIC.

    Obviously, here, if children in other countries used modified versions
    of BASIC that used keywords in their own natural language, it would be
    much easier for them to get started with programming than if the
    keywords were simply arbitrary strings of letters, taken from a
    foreign language of which they may not necessarily have any knowledge.

    Logo came in versions for different native languages, but looking at <https://de.wikipedia.org/wiki/Logo_(Programmiersprache)>, it shows
    English Logo examples before German Logo examples. I tried Logo on my
    C64; I don't know whether it was in English or German, but in any case
    I was not particularly impressed.

    The C64 as well as many other home computers came with BASIC, and
    BASIC was widely used, and before today I never heard or read any
    suggestion to use native-language commands in BASIC.

    I have seen some suggestions to provide native-language versions of
    Forth, but they never went anywhere (if they were serious). The main motivation here seems to have been that it's easy to do that in Forth,
    so is there a nail to which we can apply this hammer? I attend
    German-language Forth events where some of the partisipants are not
    good enough at English to, e.g., read articles about Forth in English,
    but none of them has Germanized his personal Forth system.

    Scratch is also designed for children and supports native-language
    switching, which eliminates one of the drawbacks of native-language
    versions.

    Like Logo, Scratch comes out of the MIT, and I wonder if the idea that programmers have problems with names that are not in their native
    language is due to their American background.

    If Algol was supposed to be an _international_ algorithmic language,
    why weren't its keywords taken from Latin or Esperanto, instead of
    English?

    Algol 60 does not standardize a program representation in characters
    (a grave mistake fixed by most later programming languages, but ). It
    also does not standardize reserved words (aka keywords); instead, it
    has symbols that are typically written in bold in publications to
    differentiate them from identifiers written in a normal typeface.

    It is up to the compiler implementor how the programmer has to provide
    these symbols; one way is to surround each such symbol with single
    quotes (used in ICT 1900 Algol). A compiler implementor could instead
    (or in addition) support native-language representations of these
    symbols, but I am not aware that this has happened. After all, it's
    an international language, not a national language; or maybe such
    attempts were made and sunk without much notice, for the same reasons
    we have been discussing all along.

    Elliot 803 Algol uses the reserved word approach that means that
    programs don't work that use, e.g., "if" as identifier, but has the
    advantage that you don't need to put that many single quotes in the
    code. This is the approach that won in later programming languages,
    but it makes it hard to introduce new reserved words in later versions
    (they may conflict with existing programs).

    As for why the Algol standard was written in English and used names
    from English rather than from Latin, that's because Algol was designed
    in 1960 when English was the lingua franca among scholars, not before
    ~1700 when Latin served that role. And Esperanto never reached that
    status.

    But concerning Latin, on the last EuroForth conference (near Rome)
    Ulrich Hoffmann gave an amusing talk where he presented a Latinized
    Forth complete with Roman numerals. Unfortunately, that talk is not
    (yet?) online.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@[email protected] (MitchAlsup1) to comp.arch on Mon May 20 17:44:48 2024
    From Newsgroup: comp.arch

    John Savard wrote:


    Historical note: Algol was originally called IAL; remember what JOVIAL
    stood for.

    Who was Joe ?? in Jovial
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Fuld@[email protected] to comp.arch on Mon May 20 19:26:39 2024
    From Newsgroup: comp.arch

    MitchAlsup1 wrote:

    John Savard wrote:


    Historical note: Algol was originally called IAL; remember what
    JOVIAL stood for.

    Who was Joe ?? in Jovial


    Just in case you weren't joking,

    Jules Own Version of the International Algorithmic Language

    Jules was Jules Schwartz

    https://en.wikipedia.org/wiki/Jules_Schwartz
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@[email protected] to comp.arch on Wed May 22 02:16:21 2024
    From Newsgroup: comp.arch

    On Mon, 20 May 2024 19:26:39 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

    MitchAlsup1 wrote:

    John Savard wrote:


    Historical note: Algol was originally called IAL; remember what
    JOVIAL stood for.

    Who was Joe ?? in Jovial


    Just in case you weren't joking,

    Jules Own Version of the International Algorithmic Language

    Jules was Jules Schwartz

    https://en.wikipedia.org/wiki/Jules_Schwartz

    Not to be confused with Julius Schwartz.

    https://en.wikipedia.org/wiki/Julius_Schwartz

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@[email protected] to comp.arch on Wed May 22 15:38:51 2024
    From Newsgroup: comp.arch

    Assume you're implementing a language which has a function of setting
    an individual character in a string.
    That's a design mistake in the language, and I know no language that
    has this misfeature.
    I suspect "individual character" meant "code point" above.
    I meant character, not code point, as should have become clear from
    the following. I think that Thomas Koenig meant "character", too, but
    he may have been unaware of the difference between "character" and
    "Unicode code point".

    I don't know of any language (or even library) that supports the notion
    of "character" for Unicode strings. 🙁

    OTOH, most code can be implemented fine as working on strings, without knowing how many characters there are in the string (and it then does
    not need to know about code points, either).

    Indeed, most operations on strings are conversion of things to strings, concatenation of strings, search (typically for a substring or a regexp), extraction of substring where the boundaries result from an earlier
    search, and parsing (which at the bottom relies often on some sort of
    regexp or equivalent system).

    All of those work just fine on a UTF-8 sequence of bytes.

    Emacs Lisp has this misfeature as well (and so does Common Lisp). 🙁
    It's really hard to get rid of it, even though it's used *very* rarely.
    In ELisp, strings are represented internally as utf-8 (tho it pretends
    to be an array opf code points), so an assignment that replaces a single
    char can require reallocating the array!
    One way forward might be to also provide a string-oriented API with
    byte (code unit) indices, and recommend that people use that instead
    of the inefficient code-point-indexed API.

    I think the long term solution for ELisp will be to declare strings as basically immutable.

    Because you know your string only contains "characters" made of a single
    code point?

    This incorrect "knowledge" may be the reason why Emacs 27.1 displays

    K̖̈nig

    as if the first three-code-point character actually was three characters.

    No, the above seems like a problem in the redisplay code, and that code
    is quite aware of combining characters and stuff. You're probably
    seeing simply a missing rule to allow composition/shaping of your word.
    (the composition/shaping library operates on whole strings at a time,
    but Emacs tends to be quite conservative about the string-chunks it
    sends to that library).

    I recommend you `M-x report-emacs-bug`. The fix should be fairly simple.

    E.g. your string contains the representation of the border of a table
    (to be displayed in a tty), and you want to "move" the `+` of a column
    separator (or a prettier version that takes advantage of the wider
    choice offered by Unicode).
    These kinds of things involve additional complications.

    Very much so, indeed. It usually breaks down in many different ways
    because of the common-but-not-guaranteed assumptions.


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@[email protected] to comp.arch on Wed May 22 17:15:53 2024
    From Newsgroup: comp.arch

    On 5/22/2024 2:38 PM, Stefan Monnier wrote:
    Assume you're implementing a language which has a function of setting >>>>> an individual character in a string.
    That's a design mistake in the language, and I know no language that
    has this misfeature.
    I suspect "individual character" meant "code point" above.
    I meant character, not code point, as should have become clear from
    the following. I think that Thomas Koenig meant "character", too, but
    he may have been unaware of the difference between "character" and
    "Unicode code point".

    I don't know of any language (or even library) that supports the notion
    of "character" for Unicode strings. 🙁


    Mostly just codepoints.

    One can take their pick between UTF-16 and UTF-32, but on-average UTF-16
    uses less memory than UTF-32.

    Then there is a schism between the worlds of 16 or 32 bit wchar_t, ...

    OTOH, most code can be implemented fine as working on strings, without
    knowing how many characters there are in the string (and it then does
    not need to know about code points, either).

    Indeed, most operations on strings are conversion of things to strings, concatenation of strings, search (typically for a substring or a regexp), extraction of substring where the boundaries result from an earlier
    search, and parsing (which at the bottom relies often on some sort of
    regexp or equivalent system).

    All of those work just fine on a UTF-8 sequence of bytes.


    Sometimes it depends on context which is best.

    For general use in C (or in OS APIs), UTF-8 is a win.
    Likewise it is a sensible choice within a filesystem, or for file
    storage, ...

    Sometimes, one has languages that view everything as if it were UTF-16 codepoints. But, UTF-16 wastes memory in many cases. In these cases, the winning option here may end up being to use 8859-1 or 1252 (for string literals), or M-UTF-8 for external storage (UTF-8 encoded UTF-16
    strings, with NUL escape-coded as C0-80).


    To some extent, my TestKern sub-project is using a hacked version of
    Unicode:
    Text is typically stored (and transmitted to/from OS APIs) using M-UTF-8; Generally, 0080..009F are interpreted as in 1252 (printable characters
    rather than extended control codes);
    0400..04FF are interpreted as "dense hexadecimal" or "inline raw data"
    rather than Arabic in certain contexts (*1), ...


    *1: Though, this shouldn't break much, as the contexts where dense
    hexadecimal or raw data would exist are likely mutually exclusive from
    those that would need Arabic (and failing this could probably use some
    extra encoding hackery).

    Well, and the extra hack that is "Double-encoded UTF-8" (used internally
    by BGBCC for u8 string literals), where parts of the space are
    repurposed (mostly to reduce codepoints needing 4-6 bytes, ...).


    Emacs Lisp has this misfeature as well (and so does Common Lisp). 🙁
    It's really hard to get rid of it, even though it's used *very* rarely.
    In ELisp, strings are represented internally as utf-8 (tho it pretends
    to be an array opf code points), so an assignment that replaces a single >>> char can require reallocating the array!
    One way forward might be to also provide a string-oriented API with
    byte (code unit) indices, and recommend that people use that instead
    of the inefficient code-point-indexed API.

    I think the long term solution for ELisp will be to declare strings as basically immutable.


    In general, it makes sense to regard strings as immutable. My own
    language designs had generally assumed immutable strings (if one wants a mutable string, they use an array of a character type).

    Because you know your string only contains "characters" made of a single >>> code point?

    This incorrect "knowledge" may be the reason why Emacs 27.1 displays

    K̖̈nig

    as if the first three-code-point character actually was three characters.

    No, the above seems like a problem in the redisplay code, and that code
    is quite aware of combining characters and stuff. You're probably
    seeing simply a missing rule to allow composition/shaping of your word.
    (the composition/shaping library operates on whole strings at a time,
    but Emacs tends to be quite conservative about the string-chunks it
    sends to that library).

    I recommend you `M-x report-emacs-bug`. The fix should be fairly simple.

    E.g. your string contains the representation of the border of a table
    (to be displayed in a tty), and you want to "move" the `+` of a column
    separator (or a prettier version that takes advantage of the wider
    choice offered by Unicode).
    These kinds of things involve additional complications.

    Very much so, indeed. It usually breaks down in many different ways
    because of the common-but-not-guaranteed assumptions.


    Stefan

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sat May 25 15:48:07 2024
    From Newsgroup: comp.arch

    Stefan Monnier <[email protected]> writes:
    [Anton Ertl:]
    I meant character, not code point, as should have become clear from
    the following. I think that Thomas Koenig meant "character", too, but
    he may have been unaware of the difference between "character" and
    "Unicode code point".

    I don't know of any language (or even library) that supports the notion
    of "character" for Unicode strings.

    My experiments with Telugu suggest that Emacs understands the concept
    of a character at least for the Telugu script (in contrast to
    decomposed Umlauts). If I press a cursor key in Telugu text, Emacs
    advances to the next character, not the next code point. However, if
    I press DEL or BS, it delets a code point.

    Here's some text again for playing around with it:

    తెలుగు లిపి

    Anyway, the Emacs Lisp functions right-char (and, after testing, also left-char, forward-char, and backward-char) support the notion of
    character at least for some scripts. That may be the result of an
    interaction with the redisplay code that you mention later, but in
    that case it's that code that knows about characters in Unicode.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Fuld@[email protected] to comp.arch on Sun May 26 03:50:46 2024
    From Newsgroup: comp.arch

    John Savard wrote:

    On Sat, 18 May 2024 17:11:32 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

    and
    because it was supposedly "self documenting", easier for managers,
    etc. to see how the program worked.

    Of course, if they designed COBOL that way, why did they include a
    statement that let you re-direct GOTO statements from elsewhere in a
    program?

    That feature (Alter GOTO) was also in Fortran, as the, long since
    deprecated, assigned GOTO statement. I believe they were there to
    support some older computers that didn't have indexed jump/branch
    instructions, so achieved the effect by modifying the branch
    destination in the instruction itself. And yes, it wwas ugly and made comprehension of the program, and also debugging it, much harder.


    I mean, that was just asking for dishonest programmers to direct the
    odd pennies into their bank accounts and so on.

    Not really. You had to Alter the goto statement to some pre-existing
    label, not just anywhere in the code.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@[email protected] to comp.arch on Sun May 26 08:33:50 2024
    From Newsgroup: comp.arch

    Stephen Fuld <[email protected]d> schrieb:
    John Savard wrote:

    On Sat, 18 May 2024 17:11:32 -0000 (UTC), "Stephen Fuld"
    <[email protected]d> wrote:

    and
    because it was supposedly "self documenting", easier for managers,
    etc. to see how the program worked.

    Of course, if they designed COBOL that way, why did they include a
    statement that let you re-direct GOTO statements from elsewhere in a
    program?

    That feature (Alter GOTO) was also in Fortran, as the, long since
    deprecated, assigned GOTO statement.

    Assigned is

    ASSIGN 10 to N

    GOTO N (10, 20, 30, 40)

    10 CONTINUE

    which I don't think is what John S. is describing.

    What old FORTRAN compilers had was, for debugging, an AT statement,
    which sucked control from the statement into a DEBUG section, without visibility at the place where it came from. The proverbial COME FROM statement, used as a debugging aid; in the DEBUG section, variables
    could be printed _or changed_.

    Rumor has it that the AD statement was regularly abused, so there
    were a lot of programs which did not run cocrrectly unless debugging
    was enabled...
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@[email protected] to comp.arch on Sun May 26 10:16:27 2024
    From Newsgroup: comp.arch

    Thomas Koenig <[email protected]> schrieb:

    Rumor has it that the AD statement was regularly abused,

    s/AD/AT
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@[email protected] to comp.arch on Mon May 27 07:34:48 2024
    From Newsgroup: comp.arch

    On Sat, 18 May 2024 05:29:20 GMT, Anton Ertl wrote:

    Stefan Monnier <[email protected]> writes:

    Does Unicode even has the notion of "character", really?

    AFAIK it does not.

    It uses terms like “grapheme” and “text element” for the concept, leaving
    “character” without a fixed meaning.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@[email protected] to comp.arch on Mon May 27 07:40:42 2024
    From Newsgroup: comp.arch

    On Wed, 22 May 2024 15:38:51 -0400, Stefan Monnier wrote:

    I don't know of any language (or even library) that supports the notion
    of "character" for Unicode strings. 🙁

    Surely a “character” (or “grapheme” I think is (one of) the Unicode terms)
    is (represented by) a non-combining code point combined with all the immediately-following combining code points.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@[email protected] to comp.arch on Mon May 27 07:42:32 2024
    From Newsgroup: comp.arch

    On Mon, 20 May 2024 11:46:20 GMT, Anton Ertl wrote:

    Algol 60 does not standardize a program representation in characters (a
    grave mistake fixed by most later programming languages ...

    That would likely not have been considered feasible in 1960, given the
    wide variation in character sets between computer systems. Even I/O was considered to be in the too-hard basket back then.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@[email protected] to comp.arch on Mon May 27 07:43:42 2024
    From Newsgroup: comp.arch

    On Mon, 20 May 2024 17:44:48 +0000, MitchAlsup1 wrote:

    John Savard wrote:

    Historical note: Algol was originally called IAL; remember what JOVIAL
    stood for.

    Who was Joe ?? in Jovial

    Jules Schwartz <http://bitsavers.trailing-edge.com/pdf/sdc/jovial/Schwartz_-_The_Development_of_JOVIAL_1978.pdf>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@[email protected] to comp.arch on Mon May 27 07:45:59 2024
    From Newsgroup: comp.arch

    On Sun, 19 May 2024 15:32:49 -0600, John Savard wrote:

    If Algol was supposed to be an _international_ algorithmic language,
    why weren't its keywords taken from Latin or Esperanto, instead of
    English?

    Much of its syntax came from mathematics, which is international.

    Semi-related question: are there non-English equivalents for mathematical operators like “grad”, “div” and “curl”?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Levine@[email protected] to comp.arch on Mon May 27 15:16:13 2024
    From Newsgroup: comp.arch

    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 22 May 2024 15:38:51 -0400, Stefan Monnier wrote:

    I don't know of any language (or even library) that supports the notion
    of "character" for Unicode strings. 🙁

    Surely a “character” (or “grapheme” I think is (one of) the Unicode terms)
    is (represented by) a non-combining code point combined with all the >immediately-following combining code points.

    Take another look at the table I referred to yesterday. When you have
    ZWJ the rules of what combines with what gets awfully complicated.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@[email protected] to comp.arch on Tue May 28 01:08:06 2024
    From Newsgroup: comp.arch

    On Mon, 27 May 2024 15:16:13 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 22 May 2024 15:38:51 -0400, Stefan Monnier wrote:

    I don't know of any language (or even library) that supports the
    notion of "character" for Unicode strings. 🙁

    Surely a “character” (or “grapheme” I think is (one of) the Unicode >> terms) is (represented by) a non-combining code point combined with all
    the immediately-following combining code points.

    Take another look at the table I referred to yesterday. When you have
    ZWJ the rules of what combines with what gets awfully complicated.

    ZWJ is classed as “punctuation”, and has no combining class. So it forms a “character” or “grapheme” it its own right.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Levine@[email protected] to comp.arch on Tue May 28 01:25:38 2024
    From Newsgroup: comp.arch

    According to Lawrence D'Oliveiro <[email protected]d>:
    On Mon, 27 May 2024 15:16:13 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 22 May 2024 15:38:51 -0400, Stefan Monnier wrote:

    I don't know of any language (or even library) that supports the
    notion of "character" for Unicode strings. 🙁

    Surely a “character” (or “grapheme” I think is (one of) the Unicode >>> terms) is (represented by) a non-combining code point combined with all
    the immediately-following combining code points.

    Take another look at the table I referred to yesterday. When you have
    ZWJ the rules of what combines with what gets awfully complicated.

    ZWJ is classed as “punctuation”, and has no combining class. So it forms a
    “character” or “grapheme” it its own right.

    Really, you need to look at that combined emoji table I told you about yesterday.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@[email protected] to comp.arch on Tue May 28 01:29:31 2024
    From Newsgroup: comp.arch

    On Tue, 28 May 2024 01:25:38 -0000 (UTC), John Levine wrote:

    Really, you need to look at that combined emoji table I told you about yesterday.

    I’m just telling you what the official Unicode spec says.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Levine@[email protected] to comp.arch on Tue May 28 01:36:22 2024
    From Newsgroup: comp.arch

    It appears that Lawrence D'Oliveiro <[email protected]d> said:
    On Tue, 28 May 2024 01:25:38 -0000 (UTC), John Levine wrote:

    Really, you need to look at that combined emoji table I told you about
    yesterday.

    I’m just telling you what the official Unicode spec says.

    Um, so am I. Those nine code point things are supposed to display
    as a single little picture, regardless of what some other bit of
    the spec may assert about ZWJ.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@[email protected] to comp.arch on Tue May 28 17:04:20 2024
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro <[email protected]d> schrieb:
    On Sun, 19 May 2024 15:32:49 -0600, John Savard wrote:

    If Algol was supposed to be an _international_ algorithmic language,
    why weren't its keywords taken from Latin or Esperanto, instead of
    English?

    Much of its syntax came from mathematics, which is international.

    Semi-related question: are there non-English equivalents for mathematical operators like “grad”, “div” and “curl”?

    German has "grad", "div" and "rot". People also use the nabla
    operator, which I personally don't like.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@[email protected] to comp.arch on Tue May 28 16:37:22 2024
    From Newsgroup: comp.arch

    Anyway, the Emacs Lisp functions right-char (and, after testing, also left-char, forward-char, and backward-char) support the notion of
    character at least for some scripts. That may be the result of an interaction with the redisplay code that you mention later, but in
    that case it's that code that knows about characters in Unicode.

    Indeed, the concept is somewhat visible, but it's not really exposed in
    the language. I think what you're seeing is implemented elsewhere than
    in `forward-char`, it's a part of the interactive loop which sees that
    after `forward-char` you end up "in the middle" of a composition and it
    moves the point further, based on information that mostly belongs to the redisplay code.

    Try `C-u 2 C-f` and I suspect you'll see that it doesn't always advance
    by 2 characters but rather it advances by "2 code points + rounding up
    to the next character boundary".


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@[email protected] to comp.arch on Tue May 28 16:53:14 2024
    From Newsgroup: comp.arch

    Um, so am I. Those nine code point things are supposed to display
    as a single little picture, regardless of what some other bit of
    the spec may assert about ZWJ.

    Maybe it's a good time to start taking bets for which will be the year
    that Unicode becomes Turing complete?


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@[email protected] (Anton Ertl) to comp.arch on Wed May 29 06:59:55 2024
    From Newsgroup: comp.arch

    Stefan Monnier <[email protected]> writes:
    Anyway, the Emacs Lisp functions right-char (and, after testing, also
    left-char, forward-char, and backward-char) support the notion of
    character at least for some scripts. That may be the result of an
    interaction with the redisplay code that you mention later, but in
    that case it's that code that knows about characters in Unicode.

    Indeed, the concept is somewhat visible, but it's not really exposed in
    the language. I think what you're seeing is implemented elsewhere than
    in `forward-char`, it's a part of the interactive loop which sees that
    after `forward-char` you end up "in the middle" of a composition and it
    moves the point further, based on information that mostly belongs to the >redisplay code.

    Try `C-u 2 C-f` and I suspect you'll see that it doesn't always advance
    by 2 characters but rather it advances by "2 code points + rounding up
    to the next character boundary".

    Confirmed. So Emacs Lisp has a codepoint-oriented interface and then
    needs to compensate for that elsewhere. This does not indicate that a codepoint-oriented interface is a good idea, rather the opposite.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@[email protected] (Anton Ertl) to comp.arch on Wed May 29 08:07:50 2024
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro <[email protected]d> writes:
    On Mon, 20 May 2024 11:46:20 GMT, Anton Ertl wrote:

    Algol 60 does not standardize a program representation in characters (a
    grave mistake fixed by most later programming languages ...

    That would likely not have been considered feasible in 1960, given the
    wide variation in character sets between computer systems.

    COBOL did it. LISP did it. It was feasible in 1960. It's just that
    the Algol 60 committee did not want to go there. And the Algol 68
    committee did not want to go there even though ASCII was standardized
    in 1963, and Algol 68 was only finished in 1974 AFAIK.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@[email protected] to comp.arch on Wed May 29 10:44:21 2024
    From Newsgroup: comp.arch

    Confirmed. So Emacs Lisp has a codepoint-oriented interface and then
    needs to compensate for that elsewhere. This does not indicate that a codepoint-oriented interface is a good idea, rather the opposite.

    Note that the "round to the next character boundary" is actually
    generalized to non-Unicode concepts: you can mark a chunk of text as
    being "intangible" or make it invisible and the "round up" will
    correspondingly move to the next boundary to avoid the cursor being in
    the middle of an invisible or intangible chunk of text.

    I'm not sure the codepoint-oriented API is the best option, but it's not completely clear what *is* the best option. You mention a byte-oriented
    API and you might be right that it's a better option, but in the case of
    Emacs that's what we used in Emacs-20.1 but it worked really poorly
    because of backward compatibility issues. I think if we started from
    scratch now (i.e. without having to contend with backward compatibility,
    and with a better understanding of Unicode (which barely existed back
    then)) it might work better, indeed, but that's not been an option 🙁


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@[email protected] to comp.arch on Thu May 30 02:50:33 2024
    From Newsgroup: comp.arch

    On Wed, 29 May 2024 08:07:50 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    On Mon, 20 May 2024 11:46:20 GMT, Anton Ertl wrote:

    Algol 60 does not standardize a program representation in characters
    (a grave mistake fixed by most later programming languages ...

    That would likely not have been considered feasible in 1960, given the
    wide variation in character sets between computer systems.

    COBOL did it. LISP did it.

    And so did Fortran. They all did it by severely curtailing their allowed character sets.

    It's just that the Algol 60 committee did not want to go there.

    They wanted symbols like “÷”, “×”, “↑”, “≤”, “≥”, “≠”, “≡”, “⊃”, “∨”, “∧”,
    “¬” ... you get the idea. I don’t any computer system on earth could provide all those symbols at the time, or even, say, 20 years later.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Fuld@[email protected] to comp.arch on Thu May 30 03:21:13 2024
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro wrote:


    snip

    They wanted symbols like “÷”, “×”, “↑”, “≤”, “≥”, “≠”, “≡”, “⊃”, “∨”,
    “∧”, “¬” ... you get the idea. I don’t any computer system on earth
    could provide all those symbols at the time, or even, say, 20 years
    later.

    See APL. So many symbols that the language is almost impossible to
    read without a significant investment in learning them.

    https://en.wikipedia.org/wiki/APL_syntax_and_symbols#Monadic_functions


    Please note that I am not advocating this. It is at the opposite end
    of the spectrum from COBOL where you could get by with no special
    characters beyond periods. Neither was a good choice.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@[email protected] to comp.arch on Wed May 29 21:47:52 2024
    From Newsgroup: comp.arch

    "Stephen Fuld" <[email protected]d> writes:

    Lawrence D'Oliveiro wrote:


    snip

    They wanted symbols like [...]

    See APL. So many symbols that the language is almost impossible to
    read without a significant investment in learning them.

    https://en.wikipedia.org/wiki/APL_syntax_and_symbols#Monadic_functions

    The problem with learning APL is not the character set. APL without
    any special characters (which I actually have some experience using)
    is still unlike any other programming language that existed in the
    1960s or 1970s.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Fuld@[email protected] to comp.arch on Thu May 30 06:12:11 2024
    From Newsgroup: comp.arch

    Tim Rentsch wrote:

    "Stephen Fuld" <[email protected]d> writes:

    Lawrence D'Oliveiro wrote:


    snip

    They wanted symbols like [...]

    See APL. So many symbols that the language is almost impossible to
    read without a significant investment in learning them.


    https://en.wikipedia.org/wiki/APL_syntax_and_symbols#Monadic_functions

    The problem with learning APL is not the character set. APL without
    any special characters (which I actually have some experience using)
    is still unlike any other programming language that existed in the
    1960s or 1970s.

    OK, but my main point was to show, by counter example, the error of
    Lawrence's statement quoted below


    I don�t any computer system on earth could
    provide all those symbols at the time, or even, say, 20 years later.

    If the part about the difficulty of learning APL was wrong, then I
    apologise.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@[email protected] to comp.arch on Thu May 30 05:38:00 2024
    From Newsgroup: comp.arch

    "Stephen Fuld" <[email protected]d> writes:

    Tim Rentsch wrote:

    "Stephen Fuld" <[email protected]d> writes:

    Lawrence D'Oliveiro wrote:

    snip

    They wanted symbols like [...]

    See APL. So many symbols that the language is almost impossible to
    read without a significant investment in learning them.

    https://en.wikipedia.org/wiki/APL_syntax_and_symbols#Monadic_functions

    The problem with learning APL is not the character set. APL without
    any special characters (which I actually have some experience using)
    is still unlike any other programming language that existed in the
    1960s or 1970s.

    OK, but my main point was to show, by counter example, the error of Lawrence's statement quoted below

    I see. I misunderstood the point of what you were saying. Sorry
    about that.

    I don't any computer system on earth could provide all those
    symbols at the time, or even, say, 20 years later.

    If the part about the difficulty of learning APL was wrong, then I
    apologise.

    No apology needed. Even if the APL character set wasn't the main
    source of the difficulty, there is no question that the unusual
    choice of operator characters used contributed to the effort needed
    to understand and use APL.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@[email protected] (Anton Ertl) to comp.arch on Thu May 30 16:25:46 2024
    From Newsgroup: comp.arch

    Stefan Monnier <[email protected]> writes:
    I'm not sure the codepoint-oriented API is the best option, but it's not >completely clear what *is* the best option. You mention a byte-oriented
    API and you might be right that it's a better option, but in the case of >Emacs that's what we used in Emacs-20.1 but it worked really poorly
    because of backward compatibility issues. I think if we started from
    scratch now (i.e. without having to contend with backward compatibility,
    and with a better understanding of Unicode (which barely existed back
    then)) it might work better, indeed, but that's not been an option

    Plus, editors are among the very few uses where you have to deal with individual characters, so the "treat it as opaque string" approach
    that works so well for most other code is not good enough there. The command-line editor of Gforth is one case where we use the xchar words
    (those for dealing with code points of UTF-8).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@[email protected] to comp.arch on Thu May 30 14:01:53 2024
    From Newsgroup: comp.arch

    The problem with learning APL is not the character set. APL without
    any special characters (which I actually have some experience using)
    is still unlike any other programming language that existed in the
    1960s or 1970s.

    There have been a few languages that took similar approaches, but the
    most recent and successful I've heard of is [jq](https://en.wikipedia.org/wiki/Jq_%28programming_language%29).


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@[email protected] to comp.arch on Thu May 30 22:19:14 2024
    From Newsgroup: comp.arch

    On Wed, 29 May 2024 08:07:50 GMT, [email protected]
    (Anton Ertl) wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:
    On Mon, 20 May 2024 11:46:20 GMT, Anton Ertl wrote:

    Algol 60 does not standardize a program representation in characters (a
    grave mistake fixed by most later programming languages ...

    That would likely not have been considered feasible in 1960, given the >>wide variation in character sets between computer systems.

    COBOL did it. LISP did it. It was feasible in 1960. It's just that
    the Algol 60 committee did not want to go there.

    There was a famous article by Bob Bemer in 1960 in the Communications
    of the ACM in which he gave a talbe of all this variation in character
    sets between computers. This helped spur the adoption of ASCII.

    Algol 60 was intended as an International Algorithmic Language. In
    fact, that's what Algol was first called, hence JOVIAL. So it is _not_ particularly hard for me to believe that the international committee
    behind Algol 60 wished to support a wider variety of computers than
    the people behind COBOL and LISP. Yes, those languages, unlike
    FORTRAN, weren't the creations of a single manufacturer.

    But they _were_ fairly U.S. - centric, and Algol was *not*. For
    example, there were British computer systems that offered Algol
    compilers that based their character sets on modified 5-unit
    teleprinters.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@[email protected] to comp.arch on Thu May 30 22:22:34 2024
    From Newsgroup: comp.arch

    On Thu, 30 May 2024 02:50:33 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    And so did Fortran. They all did it by severely curtailing their allowed >character sets.

    It's just that the Algol 60 committee did not want to go there.

    They wanted symbols like ���, �ה, �?�, �?�, �?�, �?�, �?�, �?�, �?�, �?�, >��� ... you get the idea. I don�t any computer system on earth could
    provide all those symbols at the time, or even, say, 20 years later.

    Well, the 120 character chain for the STRETCH computer's printer
    handled Algol's character set. And so did the punched card code for a
    couple of Russian computers. So the attempt was made.

    And then there was the LISP machine, which started life with the
    infamous "Space Cadet" computer.

    Today, of course, we have Unicode, but that doesn't mean the entire
    Algol character set is conveniently accessible directly from the
    keyboard.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Savard@[email protected] to comp.arch on Thu May 30 22:25:47 2024
    From Newsgroup: comp.arch

    On Thu, 30 May 2024 06:12:11 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

    If the part about the difficulty of learning APL was wrong, then I
    apologise.

    I would not say that it was wrong. APL "without special characters"
    was achieved by way of a transliteration scheme, where short codes
    represented the special characters. So instead of memorizing funny
    shapes, you memorized cryptic abbreviations.

    So the character set was _still_ the source of the difficulty of
    learning APL even if you happened to be using an implementation that
    didn't have any special characters.

    John Savard
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@[email protected] to comp.arch on Fri May 31 12:59:42 2024
    From Newsgroup: comp.arch

    On Thu, 30 May 2024 22:19:14 -0600
    John Savard <[email protected]d> wrote:

    But they _were_ fairly U.S. - centric, and Algol was *not*. For
    example,

    U.S.-centric vs U.S. eccentric. http://www.cs.yale.edu/homes/perlis-alan/quotes.html

    Actually I am pretty sure that "eccentric" is not a fair
    characterisation of his personality, but can't resist.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@[email protected] to comp.arch on Fri May 31 09:47:58 2024
    From Newsgroup: comp.arch

    John Savard <[email protected]d> writes:

    On Thu, 30 May 2024 06:12:11 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

    If the part about the difficulty of learning APL was wrong, then I
    apologise.

    I would not say that it was wrong. APL "without special characters"
    was achieved by way of a transliteration scheme, where short codes represented the special characters. So instead of memorizing funny
    shapes, you memorized cryptic abbreviations.

    So the character set was _still_ the source of the difficulty of
    learning APL even if you happened to be using an implementation that
    didn't have any special characters.

    The character set was a source of some of the difficulty of
    learning APL. Certainly not all of it.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@[email protected] to comp.arch on Fri May 31 12:14:19 2024
    From Newsgroup: comp.arch

    On 5/30/2024 11:25 AM, Anton Ertl wrote:
    Stefan Monnier <[email protected]> writes:
    I'm not sure the codepoint-oriented API is the best option, but it's not
    completely clear what *is* the best option. You mention a byte-oriented
    API and you might be right that it's a better option, but in the case of
    Emacs that's what we used in Emacs-20.1 but it worked really poorly
    because of backward compatibility issues. I think if we started from
    scratch now (i.e. without having to contend with backward compatibility,
    and with a better understanding of Unicode (which barely existed back
    then)) it might work better, indeed, but that's not been an option

    Plus, editors are among the very few uses where you have to deal with individual characters, so the "treat it as opaque string" approach
    that works so well for most other code is not good enough there. The command-line editor of Gforth is one case where we use the xchar words
    (those for dealing with code points of UTF-8).


    Yeah.

    For text editors, this is one of the few cases it makes sense to use 32
    or 64 bit characters (say, combining the 'character' with some
    additional metadata such as formatting).

    Though, one thing that makes sense for text editors is if only the
    "currently being edited" lines are fully unpacked, whereas the others
    can remain in a more compact form (such as UTF-8), and are then unpacked
    as they come into view (say, treating the editor window as a 32-entry
    modulo cache or similar).

    For the rest, say, one can have, say, a big buffer, with an array of
    lines giving the location and size of the line's text in the buffer.

    If a line is modified, it can be reallocated at the end of the buffer,
    and if the buffer gets full, it can be "repacked" and/or expanded as
    needed. When written back to a file, the buffer lines can be emitted
    in-order to the text file.

    Not entirely sure how other text editors manage things here, not really
    looked into it.


    - anton

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@[email protected] (MitchAlsup1) to comp.arch on Fri May 31 17:21:53 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 5/30/2024 11:25 AM, Anton Ertl wrote:
    Stefan Monnier <[email protected]> writes:
    I'm not sure the codepoint-oriented API is the best option, but it's
    not
    completely clear what *is* the best option. You mention a
    byte-oriented
    API and you might be right that it's a better option, but in the case
    of
    Emacs that's what we used in Emacs-20.1 but it worked really poorly
    because of backward compatibility issues. I think if we started from
    scratch now (i.e. without having to contend with backward
    compatibility,
    and with a better understanding of Unicode (which barely existed back
    then)) it might work better, indeed, but that's not been an option

    Plus, editors are among the very few uses where you have to deal with
    individual characters, so the "treat it as opaque string" approach
    that works so well for most other code is not good enough there. The
    command-line editor of Gforth is one case where we use the xchar words
    (those for dealing with code points of UTF-8).


    Yeah.

    For text editors, this is one of the few cases it makes sense to use 32

    or 64 bit characters (say, combining the 'character' with some
    additional metadata such as formatting).

    Though, one thing that makes sense for text editors is if only the "currently being edited" lines are fully unpacked, whereas the others
    can remain in a more compact form (such as UTF-8), and are then
    unpacked

    as they come into view (say, treating the editor window as a 32-entry
    modulo cache or similar).

    For the rest, say, one can have, say, a big buffer, with an array of
    lines giving the location and size of the line's text in the buffer.

    In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif,
    ..}
    along with text from different fonts and different backgrounds on a per character basis.

    If a line is modified, it can be reallocated at the end of the buffer,
    and if the buffer gets full, it can be "repacked" and/or expanded as
    needed. When written back to a file, the buffer lines can be emitted in-order to the text file.

    Not entirely sure how other text editors manage things here, not really

    looked into it.

    If you think about it with the above features, you quickly realize it
    is not just text anymore.


    - anton
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@[email protected] to comp.arch on Fri May 31 12:55:59 2024
    From Newsgroup: comp.arch

    On 5/31/2024 12:21 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 5/30/2024 11:25 AM, Anton Ertl wrote:
    Stefan Monnier <[email protected]> writes:
    I'm not sure the codepoint-oriented API is the best option, but it's
    not
    completely clear what *is* the best option.  You mention a
    byte-oriented
    API and you might be right that it's a better option, but in the case
    of
    Emacs that's what we used in Emacs-20.1 but it worked really poorly
    because of backward compatibility issues.  I think if we started from >>>> scratch now (i.e. without having to contend with backward
    compatibility,
    and with a better understanding of Unicode (which barely existed back
    then)) it might work better, indeed, but that's not been an option

    Plus, editors are among the very few uses where you have to deal with
    individual characters, so the "treat it as opaque string" approach
    that works so well for most other code is not good enough there.  The
    command-line editor of Gforth is one case where we use the xchar words
    (those for dealing with code points of UTF-8).


    Yeah.

    For text editors, this is one of the few cases it makes sense to use 32

    or 64 bit characters (say, combining the 'character' with some
    additional metadata such as formatting).

    Though, one thing that makes sense for text editors is if only the
    "currently being edited" lines are fully unpacked, whereas the others
    can remain in a more compact form (such as UTF-8), and are then
    unpacked

    as they come into view (say, treating the editor window as a 32-entry
    modulo cache or similar).

    For the rest, say, one can have, say, a big buffer, with an array of
    lines giving the location and size of the line's text in the buffer.

    In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif,
    ..}
    along with text from different fonts and different backgrounds on a per character basis.


    Errm, I think we call this a word processor, not a text editor.

    Granted, text editors don't usually store font or formatting information
    in the text itself, but rather it exists temporarily for things like
    "syntax highlighting".


    If a line is modified, it can be reallocated at the end of the buffer,
    and if the buffer gets full, it can be "repacked" and/or expanded as
    needed. When written back to a file, the buffer lines can be emitted
    in-order to the text file.

    Not entirely sure how other text editors manage things here, not really

    looked into it.

    If you think about it with the above features, you quickly realize it
    is not just text anymore.


    But, word processors are their own category...

    Typically, they also have their own specialized formats (though, "big
    blob of XML inside a ZIP package" seems to have become popular).

    Whereas text-editors typically use plain ASCII/UTF-8/UTF-16 files...
    The great "feature creep" in text editors is mostly that modern ones
    support syntax highlighting and emojis.



    An intermediate option would be a wysiwyg editor that does MediaWiki or Markdown. Though, annoyingly, there don't seem to be any that exist as standalone desktop programs (seemingly invariably they are written in JavaScript or similar and intended to operate inside a browser).

    I might eventually need to get around to writing something like this
    (mostly because I use MediaWiki notation for some of my own
    documentation). Also arguably mode advanced than the system used by
    "info" and "man", though a tool along these lines could make sense (but possibly as an intermediate, with an interface more like "man" but able
    to jump between documents more like "info").



    Also, bug hunt is annoying. Find/fix one bug, but more bugs remain...
    My project is seemingly in a rather buggy state right at the moment.

    But, I guess, did add things like file redirection and similar, along
    with a few more standard commands.

    So, in the working version, technically things like "cat file1 > file2"
    or "program > file" and similar are now technically possible...

    But, also, everything has turned into a crapstorm of crashes...



    - anton

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@[email protected] (MitchAlsup1) to comp.arch on Fri May 31 19:12:49 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 5/31/2024 12:21 PM, MitchAlsup1 wrote:


    For the rest, say, one can have, say, a big buffer, with an array of
    lines giving the location and size of the line's text in the buffer.

    In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif,
    ..}
    along with text from different fonts and different backgrounds on a per
    character basis.


    Errm, I think we call this a word processor, not a text editor.

    So, you are calling AOL e-mail editor a word processor ??? !!?! Gasp !
    And every modern forum editor (this one not included) word processors
    !!

    Me thinks your definition is overly inclusive.

    Granted, text editors don't usually store font or formatting
    information

    in the text itself, but rather it exists temporarily for things like
    "syntax highlighting".


    If a line is modified, it can be reallocated at the end of the buffer,
    and if the buffer gets full, it can be "repacked" and/or expanded as
    needed. When written back to a file, the buffer lines can be emitted
    in-order to the text file.

    Not entirely sure how other text editors manage things here, not really

    looked into it.

    If you think about it with the above features, you quickly realize it
    is not just text anymore.


    But, word processors are their own category...

    Typically, they also have their own specialized formats (though, "big
    blob of XML inside a ZIP package" seems to have become popular).

    Whereas text-editors typically use plain ASCII/UTF-8/UTF-16 files...
    The great "feature creep" in text editors is mostly that modern ones
    support syntax highlighting and emojis.



    An intermediate option would be a wysiwyg editor that does MediaWiki or

    Markdown. Though, annoyingly, there don't seem to be any that exist as standalone desktop programs (seemingly invariably they are written in JavaScript or similar and intended to operate inside a browser).

    I might eventually need to get around to writing something like this
    (mostly because I use MediaWiki notation for some of my own
    documentation). Also arguably mode advanced than the system used by
    "info" and "man", though a tool along these lines could make sense (but

    possibly as an intermediate, with an interface more like "man" but able

    to jump between documents more like "info").



    Also, bug hunt is annoying. Find/fix one bug, but more bugs remain...
    My project is seemingly in a rather buggy state right at the moment.

    But, I guess, did add things like file redirection and similar, along
    with a few more standard commands.

    So, in the working version, technically things like "cat file1 > file2"

    or "program > file" and similar are now technically possible...

    But, also, everything has turned into a crapstorm of crashes...



    - anton
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From John Levine@[email protected] to comp.arch on Fri May 31 19:47:36 2024
    From Newsgroup: comp.arch

    According to Michael S <[email protected]>:
    U.S.-centric vs U.S. eccentric. >http://www.cs.yale.edu/homes/perlis-alan/quotes.html

    Actually I am pretty sure that "eccentric" is not a fair
    characterisation of his personality, but can't resist.

    He was my thesis advisor and he was pretty eccentric. In a nice way,
    but still quite a character.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Fri May 31 21:01:13 2024
    From Newsgroup: comp.arch

    [email protected] (MitchAlsup1) writes:
    BGB wrote:

    On 5/31/2024 12:21 PM, MitchAlsup1 wrote:


    For the rest, say, one can have, say, a big buffer, with an array of
    lines giving the location and size of the line's text in the buffer.

    In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif,
    ..}
    along with text from different fonts and different backgrounds on a per
    character basis.


    Errm, I think we call this a word processor, not a text editor.

    So, you are calling AOL e-mail editor a word processor ???

    Yep.


    And every modern forum editor (this one not included) word processors

    Yep. They're certainly not text editors along the lines of vim or emacs.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@[email protected] (MitchAlsup1) to comp.arch on Fri May 31 21:05:36 2024
    From Newsgroup: comp.arch

    John Levine wrote:

    According to Michael S <[email protected]>:
    U.S.-centric vs U.S. eccentric. >>http://www.cs.yale.edu/homes/perlis-alan/quotes.html

    Actually I am pretty sure that "eccentric" is not a fair
    characterisation of his personality, but can't resist.

    He was my thesis advisor and he was pretty eccentric. In a nice way,
    but still quite a character.


    Back in my day, eccentric was used in the British fashion to point out
    a person with certain qualities that make him instantly memorable, but
    not in any bad way. The Characters on Monty Python were eccentric !!

    Now it means a person with creepy qualities.

    My how the language has migrated.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@[email protected] to comp.arch on Fri May 31 17:34:04 2024
    From Newsgroup: comp.arch

    On 5/31/2024 4:01 PM, Scott Lurndal wrote:
    [email protected] (MitchAlsup1) writes:
    BGB wrote:

    On 5/31/2024 12:21 PM, MitchAlsup1 wrote:


    For the rest, say, one can have, say, a big buffer, with an array of >>>>> lines giving the location and size of the line's text in the buffer.

    In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif, >>>> ..}
    along with text from different fonts and different backgrounds on a per >>>> character basis.


    Errm, I think we call this a word processor, not a text editor.

    So, you are calling AOL e-mail editor a word processor ???

    Yep.


    And every modern forum editor (this one not included) word processors

    Yep. They're certainly not text editors along the lines of vim or emacs.


    My definition is, say:
    Text editor:
    Notepad, Notepad2, Notepad++, GEdit, SciTe, etc...
    VI, Emacs, Nano, etc, also count.
    Line Editor:
    Ed, Edlin, etc.
    Word Processor:
    Word, {Open/Libre}Office Writer, ...
    WordPad (sorta)
    ...

    The editors in a lot of email programs or forums are HTML or Markdown
    WYSIWYG editors being used as an editor, but I would not consider them
    as text-editors when used in this context.


    About as soon as one allows things like dynamic formatting, images, and
    other metadata that can't be expressed in bare ASCII or UTF-8 or
    similar, it is no longer a text editor as I see it.

    The fuzzy line here is mostly emojis, and other effects that can be
    shoehorned though UTF-8 or similar. Because, seemingly, the era of Plain
    ASCII has mostly passed (though, it seems uncommon to use characters
    outside of ASCII or 8859-1 / 1252 range all that often; apart from
    random people sticking emojis in stuff).


    Though, IIRC, if you try sticking emojis in a lot of text editors, they
    will often render in monochrome or in non-combined forms, rather than
    the full-color fully-graphical forms often expected in things like
    messaging or chat.

    So, for example, the "family" emoji might just render as the
    man/woman/child emojis, with implicit zero-width-joiners.


    Ironically, a set already exists in certain contexts in TestKern, mostly
    for the character ranges inherited from Unifont (which apparently mostly contains the original set of ~ 200 emojis developed by NTT DoCoMo and
    similar, which exist within the BMP).

    Well, and with "quality" based on the automated algorithmic conversion
    from 16x16 1bpp bitmap graphics to SDF (sorta hit/miss).

    A different (more customized) font is used for 1252-range, mostly
    because the Unifont graphics don't work well if scaled below 16x16, and
    my strategy for the "base" characters was to design things mostly around
    an 8x8 pixel cell.

    Though, for the GUI text console and similar, I ended up going for 5x6
    padded to 6x8, which doesn't really work much outside of ASCII (and
    generally a bitmap font is used for the 6x8 and 8x8 cases; falling back
    to trying to generate cells from the SDF if accessing characters outside
    the ASCII or 1252 set, with results that are generally unreadable).


    The smallest is 3x5 padded to 4x6, but this is barely passable for ASCII
    and one needs to use their imagination for some of the character glyphs
    (so I ended up going with 5x6/6x8 instead). I suspect that 3x5 is the
    smallest size possible for semi-recognizable ASCII text.

    But, one arguable merit to 3x5 is that it does allow fitting 80x25 text characters into 320x150 pixels, or 40x25 in 160x140 (roughly the same as
    the screen on the original GameBoy).

    --- Synchronet 3.20a-Linux NewsLink 1.114