Forum: War Ensemble BBS

Re: Unicode in strings

From Stefan Monnier@[email protected] to comp.arch on Tue May 14 12:24:31 2024

From Newsgroup: comp.arch

Assume you're implementing a language which has a function of setting
an individual character in a string.

That's a design mistake in the language, and I know no language that
has this misfeature.

I suspect "individual character" meant "code point" above.
Does Unicode even has the notion of "character", really?

Instead, what we see is one language (Python3) that has an even worse misfeature: You can set an individual code point in a string; see
above for the things you get when you overwrite code points.

I think it's fairly common for languages that started with strings
as "arrays of 8bit chars".

Emacs Lisp has this misfeature as well (and so does Common Lisp). 🙁
It's really hard to get rid of it, even though it's used *very* rarely.
In ELisp, strings are represented internally as utf-8 (tho it pretends
to be an array opf code points), so an assignment that replaces a single
char can require reallocating the array!

But why would one want to set individual code points?

Because you know your string only contains "characters" made of a single
code point?

E.g. your string contains the representation of the border of a table
(to be displayed in a tty), and you want to "move" the `+` of a column separator (or a prettier version that takes advantage of the wider
choice offered by Unicode).

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@[email protected] (MitchAlsup1) to comp.arch on Tue May 14 17:43:43 2024

From Newsgroup: comp.arch

Anton Ertl wrote:

Thomas Koenig <[email protected]> writes:

E.g., consider the following Gforth code (others can tell you how to
do it in Python):

"Ko\u0308nig" cr type

The output is:

König

That is, the second character consists of two Unicode code points, the
"o" and the "\u0308" (Combining Diaeresis).

(I think that somewhere along the way from the Forth system to the
xterm through copying and pasting into Emacs the second character has
become precomposed, but that's probably just as well, so you can see
what I see).

If I replace the third code point with an e, I get "Koenig". So by overwriting one code point, I insert a character into the string.

If instead I replace the second code point with a "\u0316" (Combining
Grave Accent Below):

"K\u0316\u0308nig" cr type

I get this (which looks as expected in my xterm, but not in Emacs)

K̖̈nig

The first character is now a K with a diaresis above and an accent
grave below and there are now a total of 4 characters, but still 6
code points in the string; the second character has been deleted by
this code-point replacement.

It seems to me (in my vast ignorance) that names for things should be
written in the most appropriate set of characters in the language of
the person/thing being named.

Then when such a name is "sent out to be displayed" that it is a property
of the display what character set(s) it can properly emit, and thereby
alter the string of characters as appropriate to its capabilities.

For example:: Take > "K\u0316\u0308nig" cr type ==> K̖̈nig
When displayed on a ASCII only line printer it would be written Koenig
When displayed on a enhanced ASCII printer it would be written König
When displayed on a full functional printer it would be written K̖̈nig

The problem is the mapping function between how it should be encoded
in its own native language to what can be expressed on a particular
device.

Only the display device needs to understand this mapping and NOT the program/software/device holding the string.

I think people in Japan should be able to use printf by using プリントフ There is way to much "english" in the way computers are being used.
It is similar to Anthropomorphizing animal behavior.
--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@[email protected] to comp.arch on Tue May 14 20:35:37 2024

From Newsgroup: comp.arch

On 14/05/2024 19:43, MitchAlsup1 wrote:

I think people in Japan should be able to use printf by using プリントフ There is way to much "english" in the way computers are being used.

I disagree entirely here.

For many things, international consistency is more important than
picking local-sounding names for things that have no localised meaning.
Having a Japanese name and spelling for "printf" doesn't give Japanese programmers any useful information, it is not easier to type or read,
and simply ensures that they can't cooperate and collaborate with
programmers using different languages. MS Office uses local languages
for its macros and formulas in Excel - I've never heard anyone in Norway
say they like it, and many who say it is a PITA that makes it hard to
work with and hard to search for information. Most people IME who
macros a lot prefer to stick to English.

It works the other way too. When discussing Karate or Judo, most practitioners the world over know what a "mawashi geri" or an "o soto
gari" is - most consistently use the Japanese terms regardless of native languages. Most, that is, except Americans and some other English
speakers who feel they have to use English language terms, losing a lot
of the subtlety and nuances of the terms and being different from their international peers.

And when people try to force localisation of terms that have no local
words, the result is just to encourage people to move everything over to
a single language (English).

It is similar to Anthropomorphizing animal behavior.

No, it is not.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@[email protected] to comp.arch on Tue May 14 20:47:12 2024

From Newsgroup: comp.arch

MitchAlsup1 <[email protected]> schrieb:

I think people in Japan should be able to use printf by using プリントフ

I have to put up with a minor version of that - Microsoft decided to
localize folder names ("Program files" is dislplayed as "Programme"
if you use German settings, except when you access it via the
command line), and all Excel functions are localized; depending
if you use English or German versions, arguments are separated
via comma or semicolon. Of course, the other way is a syntax error.

Saving things in native Excel format is OK, but generating a CSV
file from a program will either work or not, depending on locale
("," vs ";" and "." vs ".").

This is about as annoying as it gets...
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@[email protected] (Anton Ertl) to comp.arch on Sat May 18 05:29:20 2024

From Newsgroup: comp.arch

Stefan Monnier <[email protected]> writes:

Anton Ertl:]

Thomas Koenig:]

Assume you're implementing a language which has a function of setting
an individual character in a string.

That's a design mistake in the language, and I know no language that
has this misfeature.

I suspect "individual character" meant "code point" above.

I meant character, not code point, as should have become clear from
the following. I think that Thomas Koenig meant "character", too, but
he may have been unaware of the difference between "character" and
"Unicode code point".

Does Unicode even has the notion of "character", really?

AFAIK it does not. But applications like palindrome checkers care
about characters, not code points.

OTOH, most code can be implemented fine as working on strings, without
knowing how many characters there are in the string (and it then does
not need to know about code points, either). In other words, it can
be implemented just as well when the strings are represented as
strings of code units (whether UTF-8 (bytes), UTF-16 (16-bit code
units) or UTF-32 (32-bit code units)), and then it does not help to
convert UTF-8 to something else on input and something else to UTF-8
on output.

For the code that cares about characters, if it wants to work
correctly for characters that cannot be precomposed into a single code
point, it has to deal with characters that consist of multiple code
points, i.e., that even in UTF-32 are variable-width. So given that
you have to bite the variable-width bullet anyway, you can just as
well use UTF-8.

Instead, what we see is one language (Python3) that has an even worse
misfeature: You can set an individual code point in a string; see
above for the things you get when you overwrite code points.

I think it's fairly common for languages that started with strings
as "arrays of 8bit chars".

Apart from Python3 not in those languages that I have looked at more
closely wrt this feature.

In particular, C was created by adding a byte type to B, and that type
was called "char". It was allowed to be wider to cater for
word-addressed machines, but on byte-addressed machines "char" is
invariably a byte. To cater to Unicode, they used a two-pronged
approach: they added wchar_t and multi-byte functions (IIRC both
already in C89); wchar_t was obviously introduced to cater for the
upcoming Unicode 1.0 (which satisfied code unit=code point=character),
while the multibyte stuff was probably introduced originally for
dealing with the ASCII-compatible East-Asian encodings.

When UTF-8 arrived, the multi-byte functions proved to fit that well;
but of course there is not much usage of those functions, because most
code works fine without knowing about individual code points or
characters. And UTF-8 turned out to be the answer to dealing with
Unicode that the Unix programmers who had a lot of code working with
strings of chars (i.e., bytes) were looking for.

Then Unicode 2.0 arrived and the Win32 API (which had embraced wchar_t
and defined it as being 16-bit) stuck with 16-bit wchar_t, which
breaks "code unit=code point"; this may not be in line with the
intentions of the inventors of wchar_t (e.g., there are no
multi-wchar_t functions in the C standard last time I looked), but
that has been the existing practice in wchar_t use in C for more than
a quarter-century.

Unix, where wchar_t was (and still is) little used, switched to 32-bit
wchar_t, but

1) given that Unicode at some point (probably already in 2.0) broke
"code point=character", that does not really help software like
palindrome checkers.

2) wchar_t is little-used in Unix-specific code.

3) Code that wants to be portable between Unix and Windows and uses
wchar_t cannot rely on "code unit=code point" anyway.

So, in practice, C code does not make use of the ability to set an
individual code point by overwriting a fixed-size code unit.

Forth has chars that are 8 bits wide in traditional Forth systems on byte-addressed machines. In the 1994 standard (in the middle of the
reign of Unicode 1.0, and with lots of Californians on the
standardization committe) provided the option to implement Forth
systems with chars that take a fixed number >1 of bytes, and one
system (JaxForth by Jack Woehr for Windows NT) implemented 16-bit
chars.

However, JaxForth was not very popular, and most code assumed that 1
char = 1 (i.e., 8 bits on a byte-addressed machine), and given that
there was no widely available system that deviated from that, even
code that wanted to avoid this assumption could not be tested. And
given that most code has this assumption and would not work on systems
with 1 chars > 1, all the other systems stuck with 1 char = 1. A Chicken-and-Egg problem? Not really:

When we looked at the problem in 2004, we found that most code works
fine with UTF-8; that's because most code does not care about
characters. Even code that uses words like C@ (load a char from
memory) typically does it in a way that works with UTF-8. We proposed
a number of words for dealing with variable-width xchars (what C calls multi-byte characters), and you can theoretically use them with the
pre-Unicode East-Asian encodings as well as with UTF-8. These words
were standardized in Forth-2012, but they are actually little-used
(including by me), because most code actually works fine with opaque
strings.

In Gforth, an xchar is a code point, not a character, so these words
are currently less useful for writing Palindrome checkers than one
might hope. Maybe at some point we will look at the problem again,
and provide words for dealing with characters, Unicode normalization,
collating order and such things, but for now the pain is not big
enough to tackle that problem.

Finally, I proposed to standardize the common practice 1 chars = 1;
this proposal was accepted for standardization in 2016.

Emacs Lisp has this misfeature as well (and so does Common Lisp). 🙁
It's really hard to get rid of it, even though it's used *very* rarely.
In ELisp, strings are represented internally as utf-8 (tho it pretends
to be an array opf code points), so an assignment that replaces a single
char can require reallocating the array!

One way forward might be to also provide a string-oriented API with
byte (code unit) indices, and recommend that people use that instead
of the inefficient code-point-indexed API. For a high-level language
like Elisp or Python, the internal representation can depend on which
function was last used on the string. So if code uses only the
string-oriented API, you may be able to avoid the costs of the
code-point API completely.

But why would one want to set individual code points?

Because you know your string only contains "characters" made of a single
code point?

This incorrect "knowledge" may be the reason why Emacs 27.1 displays

K̖̈nig

as if the first three-code-point character actually was three characters.

E.g. your string contains the representation of the border of a table
(to be displayed in a tty), and you want to "move" the `+` of a column >separator (or a prettier version that takes advantage of the wider
choice offered by Unicode).

These kinds of things involve additional complications. Not only do
you have to know the difference between code points and characters,
you also have to know the visual width of a character which is 0-2 for fixed-width fonts to be used in xterm or the like. Actually, if you
treat a combining mark as having width 0, you may be able to work with
code points and do not need characters.

Why do you want to move the column separator and what do you want to
overwrite with it? This is likely the result of another operation,
and maybe that involves another string replacement; and displaying the
result involves so much overhead that using a string replacement
instead of a fixed-width store is probably not the dominant cost. And
if the replacement string happens to have as many bytes as the
replaced string (which would happen for, e.g., replacing " " with
"+"), the operation is not so expensive anyway.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@[email protected] to comp.arch on Sat May 18 08:29:12 2024

From Newsgroup: comp.arch

Anton Ertl <[email protected]> schrieb:

Stefan Monnier <[email protected]> writes:

Does Unicode even has the notion of "character", really?

AFAIK it does not. But applications like palindrome checkers care
about characters, not code points.

Considering the huge market for palindrome checkers, that is a
real concern, especially if they involve characters for which
UTF-32 is not sufficient, such as smileys.

Is there any language whose characters cannot be represented in
UTF-32?
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@[email protected] (Anton Ertl) to comp.arch on Sat May 18 08:40:40 2024

From Newsgroup: comp.arch

[email protected] (MitchAlsup1) writes:

It seems to me (in my vast ignorance) that names for things should be
written in the most appropriate set of characters in the language of
the person/thing being named.

Then when such a name is "sent out to be displayed" that it is a property
of the display what character set(s) it can properly emit, and thereby
alter the string of characters as appropriate to its capabilities.

For example:: Take > "K\u0316\u0308nig" cr type ==> K̖̈nig
When displayed on a ASCII only line printer it would be written Koenig
When displayed on a enhanced ASCII printer it would be written König
When displayed on a full functional printer it would be written K̖̈nig

Why do you think that K̖̈nig should be written as Koenig or König?

However, for König Unicode specifies that the precomposed form is
König. And if you want a transcription into ASCII with the knowledge
that it's German, the result would be Koenig.

Only the display device needs to understand this mapping and NOT the >program/software/device holding the string.

Yes, that's why treating string data as opaque works for most of the
code.

I think people in Japan should be able to use printf by using プリントフ >There is way to much "english" in the way computers are being used.

I don't know how Japanese feel about that, but I certainly don't want
to have to use some Germanized form of C or Forth. This kind of
catering for different natural-language programmers has been tried and
has not taken over the world. I guess that's because

1) You need to learn a lot about what "printf" means and how it is
used; remembering the name is only a minor aspect.

2) Having a name common on all the world allows you to read programs
from all over the world, use reference material from all over the
world, etc.

A similar concept was implemented in COBOL, where the designers though
that having to write

ADD A TO B GIVING C

or somesuch makes programming easier than writing

C = A+B

in FORTRAN. Has not found many followers, either. Interestingly,
among the Algol descendents, the BCPL (and later B and C) syntax,
which, e.g., replaced 'or' with || or |, and was otherwise more
symbolic and less natural-language-oriented than its ancestor Algol
60, was the most successful syntax style among the Algol descendents,
including spreading to languages like Java that are closer to Algol 60
or Pascal in other respects.

I have seen programmers define their own names based on their native
language, however. But if they use names in their own language, these
names should not depend on the environment.

In the macro language of a game I play, you can refer to things
through their name or through their numeric id. Unfortunately, the
names are localized, so the only way to write portable macros is by
using the unmnemonic numeric ids:-(.

What is more common than localized programming languages is producing
error messages in localized languages. I find this annoying, too,
because it makes it harder to find out how others have solved the same
problem.

And, e.g., ENOTSUP in Unix, has such a specific meaning that the
lozalized text does not help the person unfamiliar with Unix, while it
makes life harder for people who know Unix enough to make sense of the
message; i.e., even though my native language is German, I find
"Operation not supported" easier to understand than "Operation wird
nicht unterstützt"; in the latter case I first have to guess what the
English error message would have been and then I can start analysing
the problem.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@[email protected] (Anton Ertl) to comp.arch on Sat May 18 10:14:44 2024

From Newsgroup: comp.arch

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

Stefan Monnier <[email protected]> writes:

Does Unicode even has the notion of "character", really?

AFAIK it does not. But applications like palindrome checkers care
about characters, not code points.

Considering the huge market for palindrome checkers, that is a
real concern, especially if they involve characters for which
UTF-32 is not sufficient, such as smileys.

Is there any language whose characters cannot be represented in
UTF-32?

The goal of Unicode is to support all writng systems; AFAIK they are
not yet finished, but they expect that these writing systems will all
fit into the space provided by UTF-16 (i.e., a little over one million
code points), but they found it necessary to introduce the concept of
composing glyphs from multiple code points.

So if your question is: "Is there any language where one character
cannot be represented by a single Unicode code point?" The answer is
that the Unicode designers certainly expect that there are such
writing systems.

And looking at <https://en.wikipedia.org/wiki/Telugu_script> (just an
example), I see that the table of Unicode code points for Telugu <https://en.wikipedia.org/wiki/Telugu_script#Unicode> is much smaller
than the tables of glyphs in <https://en.wikipedia.org/wiki/Telugu_script#Articulation_of_consonants>
and <https://en.wikipedia.org/wiki/Telugu_script#Consonants_with_vowel_diacritics>, so the Telugu script seems to be one writing system that cannot be
represented with only precomposed characters.

I don't know if palindromes are a thing in Telugu, though.

But, as your reference to the size of the market for palindrome
checkers indicates, there is actually little code where dealing with
individual characters is relevant. For code where individual
characters are not relevant and opaque strings are sufficient, there
is no reason to use UTF-32. And for code where individual characters
are relevant, code points are not sufficient in general, so there is
no reason to use UTF-32 for that, either.

Interestingly, Emacs 27.1 manages to deal with "తెలుగు లిపి" (which
contains 6 characters composed of a total of 11 code points) just
fine, while it fails on König (with a decomposed Umlaut-o).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@[email protected] to comp.arch on Sat May 18 14:09:31 2024

From Newsgroup: comp.arch

Anton Ertl <[email protected]> schrieb:

[email protected] (MitchAlsup1) writes:

It seems to me (in my vast ignorance) that names for things should be >>written in the most appropriate set of characters in the language of
the person/thing being named.

Then when such a name is "sent out to be displayed" that it is a property >>of the display what character set(s) it can properly emit, and thereby >>alter the string of characters as appropriate to its capabilities.

For example:: Take > "K\u0316\u0308nig" cr type ==> K̖̈nig
When displayed on a ASCII only line printer it would be written Koenig
When displayed on a enhanced ASCII printer it would be written König
When displayed on a full functional printer it would be written K̖̈nig

Why do you think that K̖̈nig should be written as Koenig or König?

On my display, this read K, n with a diacritic and something close to
a cedille under the n.

However, for König

Again, the diaresis is over the n, not the o.

Unicode specifies that the precomposed form is
König. And if you want a transcription into ASCII with the knowledge
that it's German, the result would be Koenig.

This is actually sometimes a (fairly minor) problem because the
name on my passport actually reads "König" (o-diacritic), but
people without knowledge of German tend to translscribe this as
"Konig", whereas I transcribe it as "Koenig" on offical forms
such as the one I need to fill out prior to entering the US.

This is why modern EU passports have a canonical form of the
name, which then is "KOENIG".
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@[email protected] to comp.arch on Sat May 18 16:25:54 2024

From Newsgroup: comp.arch

Thomas Koenig wrote:

Anton Ertl <[email protected]> schrieb:

[email protected] (MitchAlsup1) writes:

It seems to me (in my vast ignorance) that names for things should be>>> written in the most appropriate set of characters in the language of
the person/thing being named.

Then when such a name is "sent out to be displayed" that it is a property >>> of the display what character set(s) it can properly emit, and thereby
alter the string of characters as appropriate to its capabilities.

For example:: Take > "K\u0316\u0308nig" cr type ==> K̖̈nig
When displayed on a ASCII only line printer it would be written Koenig
When displayed on a enhanced ASCII printer it would be written König
When displayed on a full functional printer it would be written K̖̈nig

Why do you think that K̖̈nig should be written as Koenig or König?

On my display, this read K, n with a diacritic and something close to
a cedille under the n.

However, for König

Again, the diaresis is over the n, not the o.

Unicode specifies that the precomposed form is
König. And if you want a transcription into ASCII with the knowledge
that it's German, the result would be Koenig.

This is actually sometimes a (fairly minor) problem because the
name on my passport actually reads "König" (o-diacritic), but
people without knowledge of German tend to translscribe this as
"Konig", whereas I transcribe it as "Koenig" on offical forms
such as the one I need to fill out prior to entering the US.

This is why modern EU passports have a canonical form of the
name, which then is "KOENIG".

Same problem as my wife and kids who have Norløff either a part of their surname or (my wife) as-is.
Canonical simplification of the 'ø' character is either 'o' or 'oe', and passports and airline tickets differ, something which can cause all
sorts of issues with US passport control.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@[email protected] to comp.arch on Sat May 18 14:41:04 2024

From Newsgroup: comp.arch

Terje Mathisen <[email protected]> schrieb:

Canonical simplification of the 'ø' character is either 'o' or 'oe', and passports and airline tickets differ, something which can cause all
sorts of issues with US passport control.

Reminds me of either "Asterix and the Great Crossing" or "Asterix
and the Normans", where Viking speach was indicated by having
slashes through letters (like ø). When Obelix tries to speak
their language, he also applies slashes, but does so randomly
(like through a c) so nobody can understand him.

Hmm... a challenge, can this be represented as Unicode codepoints?
I would not be surprised if some Asterix fan had snuck it in while
nobody was looking.

(For those who don't know Asterix: It is a comic that was/is wildly
popular in France and Germany at least, about Gauls who keep on
resisting Roman occupation in the times of Julius Caesar, aided
by a magic potion which gives them superhuman strength.)
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@[email protected] (Anton Ertl) to comp.arch on Sat May 18 15:43:05 2024

From Newsgroup: comp.arch

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

Why do you think that K̖̈nig should be written as Koenig or König?

On my display, this read K, n with a diacritic and something close to
a cedille under the n.

That displays correctly then. The "close to cedille" is an accent
grave below.

However, for König

Again, the diaresis is over the n, not the o.

That's strage, in the first case your display system composes the
diaresis correctly with the preceding glyph (at that point, a K with
accent grave below), but in the o case, it incorrectly composes it
with the next glyph.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@[email protected] (Anton Ertl) to comp.arch on Sat May 18 15:48:35 2024

From Newsgroup: comp.arch

Thomas Koenig <[email protected]> writes:

Terje Mathisen <[email protected]> schrieb:

Canonical simplification of the 'ø' character is either 'o' or 'oe', and >> passports and airline tickets differ, something which can cause all
sorts of issues with US passport control.

Reminds me of either "Asterix and the Great Crossing" or "Asterix
and the Normans", where Viking speach was indicated by having
slashes through letters (like ø). When Obelix tries to speak
their language, he also applies slashes, but does so randomly
(like through a c) so nobody can understand him.

Hmm... a challenge, can this be represented as Unicode codepoints?

Sure. See <https://en.wikipedia.org/wiki/Bar_(diacritic)>.
Interestingly, the Obelix character ȼ you mention above has it's own precomposed code point U+023C (Latin Small Letter C with Stroke) and
its own Wikipedia page: https://en.wikipedia.org/wiki/%C8%BB, but you
can also compose it from c and the combining short solidus overlay: c̷
(this does not display correctly on emacs 27.1, but composes correctly
on an xterm. There is no precomposed Latin Small Letter D with
Stroke, but you can compose it in the same way: d̷.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.20a-Linux NewsLink 1.114

From John Levine@[email protected] to comp.arch on Sat May 18 17:09:44 2024

From Newsgroup: comp.arch

According to Thomas Koenig <[email protected]>:

Considering the huge market for palindrome checkers, that is a
real concern, especially if they involve characters for which
UTF-32 is not sufficient, such as smileys.

Is there any language whose characters cannot be represented in
UTF-32?

Chinese. There is a huge backlog of obscure but real Chinse characters
that do not have a Unicode code point. This ISO committee is slowly
working through them. Every couple of years they approve a batch of
several thousand of them.

https://en.wikipedia.org/wiki/Ideographic_Research_Group
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Fuld@[email protected] to comp.arch on Sat May 18 17:11:32 2024

From Newsgroup: comp.arch

Anton Ertl wrote:

snip

A similar concept was implemented in COBOL, where the designers though
that having to write

ADD A TO B GIVING C

or somesuch makes programming easier than writing

C = A+B

in FORTRAN.

I would put a slightly different spin on it. I believe that the
original COBOL was designed not so much to make programming easier, but
to make *learning* programming (for non-programmers) easier, and
because it was supposedly "self documenting", easier for managers, etc.
to see how the program worked. Remember, when COBOL was developed
(late 1950s), there weren't many programmers in existance, and it was
felt that the "mathematical" syntax of Fortran, would be too unfamiliar
to the business people who developed the new programs to solve business problems, and who were generally not mathematicians.

Of course, they were wrong about "self documenting", and as more people
became programmers, the advantages of consice syntax made a big
difference.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20a-Linux NewsLink 1.114

From John Savard@[email protected] to comp.arch on Sun May 19 15:32:49 2024

From Newsgroup: comp.arch

On Tue, 14 May 2024 17:43:43 +0000, [email protected] (MitchAlsup1)
wrote:

I think people in Japan should be able to use printf by using ?????
There is way to much "english" in the way computers are being used.
It is similar to Anthropomorphizing animal behavior.

One could quibble.

If Japanese people needed to enter kana from their keyboards to write
programs, that would be awkward; there is not yet a good way to enter
that kind of text from a keyboard.

However, I think your point is valid. At least in some contexts.

Remember back in the early 8-bit days of computing, and before them,
when schools were exposing children to PDP-8 computers?

Children were learning to program computers in BASIC.

Obviously, here, if children in other countries used modified versions
of BASIC that used keywords in their own natural language, it would be
much easier for them to get started with programming than if the
keywords were simply arbitrary strings of letters, taken from a
foreign language of which they may not necessarily have any knowledge.

If Algol was supposed to be an _international_ algorithmic language,
why weren't its keywords taken from Latin or Esperanto, instead of
English?

Historical note: Algol was originally called IAL; remember what JOVIAL
stood for.

But the objections about sharing code between countries, and the fact
that English is so widely known in technical circles, are also true.
It is a complicated issue, made worse by the fact that nationalism and ethnocentricism are often bad things.

John Savard
--- Synchronet 3.20a-Linux NewsLink 1.114

From John Savard@[email protected] to comp.arch on Sun May 19 15:36:45 2024

From Newsgroup: comp.arch

On Sat, 18 May 2024 17:11:32 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

and
because it was supposedly "self documenting", easier for managers, etc.
to see how the program worked.

Of course, if they designed COBOL that way, why did they include a
statement that let you re-direct GOTO statements from elsewhere in a
program?

I mean, that was just *asking* for dishonest programmers to direct the
odd pennies into their bank accounts and so on.

John Savard
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@[email protected] (Anton Ertl) to comp.arch on Mon May 20 11:46:20 2024

From Newsgroup: comp.arch

John Savard <[email protected]d> writes:

Remember back in the early 8-bit days of computing, and before them,
when schools were exposing children to PDP-8 computers?

Children were learning to program computers in BASIC.

Obviously, here, if children in other countries used modified versions
of BASIC that used keywords in their own natural language, it would be
much easier for them to get started with programming than if the
keywords were simply arbitrary strings of letters, taken from a
foreign language of which they may not necessarily have any knowledge.

Logo came in versions for different native languages, but looking at <https://de.wikipedia.org/wiki/Logo_(Programmiersprache)>, it shows
English Logo examples before German Logo examples. I tried Logo on my
C64; I don't know whether it was in English or German, but in any case
I was not particularly impressed.

The C64 as well as many other home computers came with BASIC, and
BASIC was widely used, and before today I never heard or read any
suggestion to use native-language commands in BASIC.

I have seen some suggestions to provide native-language versions of
Forth, but they never went anywhere (if they were serious). The main motivation here seems to have been that it's easy to do that in Forth,
so is there a nail to which we can apply this hammer? I attend
German-language Forth events where some of the partisipants are not
good enough at English to, e.g., read articles about Forth in English,
but none of them has Germanized his personal Forth system.

Scratch is also designed for children and supports native-language
switching, which eliminates one of the drawbacks of native-language
versions.

Like Logo, Scratch comes out of the MIT, and I wonder if the idea that programmers have problems with names that are not in their native
language is due to their American background.

If Algol was supposed to be an _international_ algorithmic language,
why weren't its keywords taken from Latin or Esperanto, instead of
English?

Algol 60 does not standardize a program representation in characters
(a grave mistake fixed by most later programming languages, but ). It
also does not standardize reserved words (aka keywords); instead, it
has symbols that are typically written in bold in publications to
differentiate them from identifiers written in a normal typeface.

It is up to the compiler implementor how the programmer has to provide
these symbols; one way is to surround each such symbol with single
quotes (used in ICT 1900 Algol). A compiler implementor could instead
(or in addition) support native-language representations of these
symbols, but I am not aware that this has happened. After all, it's
an international language, not a national language; or maybe such
attempts were made and sunk without much notice, for the same reasons
we have been discussing all along.

Elliot 803 Algol uses the reserved word approach that means that
programs don't work that use, e.g., "if" as identifier, but has the
advantage that you don't need to put that many single quotes in the
code. This is the approach that won in later programming languages,
but it makes it hard to introduce new reserved words in later versions
(they may conflict with existing programs).

As for why the Algol standard was written in English and used names
from English rather than from Latin, that's because Algol was designed
in 1960 when English was the lingua franca among scholars, not before
~1700 when Latin served that role. And Esperanto never reached that
status.

But concerning Latin, on the last EuroForth conference (near Rome)
Ulrich Hoffmann gave an amusing talk where he presented a Latinized
Forth complete with Roman numerals. Unfortunately, that talk is not
(yet?) online.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@[email protected] (MitchAlsup1) to comp.arch on Mon May 20 17:44:48 2024

From Newsgroup: comp.arch

John Savard wrote:

Historical note: Algol was originally called IAL; remember what JOVIAL
stood for.

Who was Joe ?? in Jovial
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Fuld@[email protected] to comp.arch on Mon May 20 19:26:39 2024

From Newsgroup: comp.arch

MitchAlsup1 wrote:

John Savard wrote:

Historical note: Algol was originally called IAL; remember what
JOVIAL stood for.

Who was Joe ?? in Jovial

Just in case you weren't joking,

Jules Own Version of the International Algorithmic Language

Jules was Jules Schwartz

https://en.wikipedia.org/wiki/Jules_Schwartz
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20a-Linux NewsLink 1.114

From John Savard@[email protected] to comp.arch on Wed May 22 02:16:21 2024

From Newsgroup: comp.arch

On Mon, 20 May 2024 19:26:39 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

MitchAlsup1 wrote:

John Savard wrote:

Historical note: Algol was originally called IAL; remember what
JOVIAL stood for.

Who was Joe ?? in Jovial

Just in case you weren't joking,

Jules Own Version of the International Algorithmic Language

Jules was Jules Schwartz

https://en.wikipedia.org/wiki/Jules_Schwartz

Not to be confused with Julius Schwartz.

https://en.wikipedia.org/wiki/Julius_Schwartz

John Savard
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@[email protected] to comp.arch on Wed May 22 15:38:51 2024

From Newsgroup: comp.arch

Assume you're implementing a language which has a function of setting
an individual character in a string.

That's a design mistake in the language, and I know no language that
has this misfeature.

I suspect "individual character" meant "code point" above.

I meant character, not code point, as should have become clear from
the following. I think that Thomas Koenig meant "character", too, but
he may have been unaware of the difference between "character" and
"Unicode code point".

I don't know of any language (or even library) that supports the notion
of "character" for Unicode strings. 🙁

OTOH, most code can be implemented fine as working on strings, without knowing how many characters there are in the string (and it then does
not need to know about code points, either).

Indeed, most operations on strings are conversion of things to strings, concatenation of strings, search (typically for a substring or a regexp), extraction of substring where the boundaries result from an earlier
search, and parsing (which at the bottom relies often on some sort of
regexp or equivalent system).

All of those work just fine on a UTF-8 sequence of bytes.

Emacs Lisp has this misfeature as well (and so does Common Lisp). 🙁
It's really hard to get rid of it, even though it's used *very* rarely.
In ELisp, strings are represented internally as utf-8 (tho it pretends
to be an array opf code points), so an assignment that replaces a single
char can require reallocating the array!

One way forward might be to also provide a string-oriented API with
byte (code unit) indices, and recommend that people use that instead
of the inefficient code-point-indexed API.

I think the long term solution for ELisp will be to declare strings as basically immutable.

Because you know your string only contains "characters" made of a single
code point?

This incorrect "knowledge" may be the reason why Emacs 27.1 displays

K̖̈nig

as if the first three-code-point character actually was three characters.

No, the above seems like a problem in the redisplay code, and that code
is quite aware of combining characters and stuff. You're probably
seeing simply a missing rule to allow composition/shaping of your word.
(the composition/shaping library operates on whole strings at a time,
but Emacs tends to be quite conservative about the string-chunks it
sends to that library).

I recommend you `M-x report-emacs-bug`. The fix should be fairly simple.

E.g. your string contains the representation of the border of a table
(to be displayed in a tty), and you want to "move" the `+` of a column
separator (or a prettier version that takes advantage of the wider
choice offered by Unicode).

These kinds of things involve additional complications.

Very much so, indeed. It usually breaks down in many different ways
because of the common-but-not-guaranteed assumptions.

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@[email protected] to comp.arch on Wed May 22 17:15:53 2024

From Newsgroup: comp.arch

On 5/22/2024 2:38 PM, Stefan Monnier wrote:

Assume you're implementing a language which has a function of setting >>>>> an individual character in a string.

That's a design mistake in the language, and I know no language that
has this misfeature.

I suspect "individual character" meant "code point" above.

I meant character, not code point, as should have become clear from
the following. I think that Thomas Koenig meant "character", too, but
he may have been unaware of the difference between "character" and
"Unicode code point".

I don't know of any language (or even library) that supports the notion
of "character" for Unicode strings. 🙁

Mostly just codepoints.

One can take their pick between UTF-16 and UTF-32, but on-average UTF-16
uses less memory than UTF-32.

Then there is a schism between the worlds of 16 or 32 bit wchar_t, ...

OTOH, most code can be implemented fine as working on strings, without
knowing how many characters there are in the string (and it then does
not need to know about code points, either).

Indeed, most operations on strings are conversion of things to strings, concatenation of strings, search (typically for a substring or a regexp), extraction of substring where the boundaries result from an earlier
search, and parsing (which at the bottom relies often on some sort of
regexp or equivalent system).

All of those work just fine on a UTF-8 sequence of bytes.

Sometimes it depends on context which is best.

For general use in C (or in OS APIs), UTF-8 is a win.
Likewise it is a sensible choice within a filesystem, or for file
storage, ...

Sometimes, one has languages that view everything as if it were UTF-16 codepoints. But, UTF-16 wastes memory in many cases. In these cases, the winning option here may end up being to use 8859-1 or 1252 (for string literals), or M-UTF-8 for external storage (UTF-8 encoded UTF-16
strings, with NUL escape-coded as C0-80).

To some extent, my TestKern sub-project is using a hacked version of
Unicode:
Text is typically stored (and transmitted to/from OS APIs) using M-UTF-8; Generally, 0080..009F are interpreted as in 1252 (printable characters
rather than extended control codes);
0400..04FF are interpreted as "dense hexadecimal" or "inline raw data"
rather than Arabic in certain contexts (*1), ...

*1: Though, this shouldn't break much, as the contexts where dense
hexadecimal or raw data would exist are likely mutually exclusive from
those that would need Arabic (and failing this could probably use some
extra encoding hackery).

Well, and the extra hack that is "Double-encoded UTF-8" (used internally
by BGBCC for u8 string literals), where parts of the space are
repurposed (mostly to reduce codepoints needing 4-6 bytes, ...).

Emacs Lisp has this misfeature as well (and so does Common Lisp). 🙁
It's really hard to get rid of it, even though it's used *very* rarely.
In ELisp, strings are represented internally as utf-8 (tho it pretends
to be an array opf code points), so an assignment that replaces a single >>> char can require reallocating the array!

One way forward might be to also provide a string-oriented API with
byte (code unit) indices, and recommend that people use that instead
of the inefficient code-point-indexed API.

I think the long term solution for ELisp will be to declare strings as basically immutable.

In general, it makes sense to regard strings as immutable. My own
language designs had generally assumed immutable strings (if one wants a mutable string, they use an array of a character type).

Because you know your string only contains "characters" made of a single >>> code point?

This incorrect "knowledge" may be the reason why Emacs 27.1 displays

K̖̈nig

as if the first three-code-point character actually was three characters.

No, the above seems like a problem in the redisplay code, and that code
is quite aware of combining characters and stuff. You're probably
seeing simply a missing rule to allow composition/shaping of your word.
(the composition/shaping library operates on whole strings at a time,
but Emacs tends to be quite conservative about the string-chunks it
sends to that library).

I recommend you `M-x report-emacs-bug`. The fix should be fairly simple.

E.g. your string contains the representation of the border of a table
(to be displayed in a tty), and you want to "move" the `+` of a column
separator (or a prettier version that takes advantage of the wider
choice offered by Unicode).

These kinds of things involve additional complications.

Very much so, indeed. It usually breaks down in many different ways
because of the common-but-not-guaranteed assumptions.

Stefan

--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@[email protected] (Anton Ertl) to comp.arch on Sat May 25 15:48:07 2024

From Newsgroup: comp.arch

Stefan Monnier <[email protected]> writes:
[Anton Ertl:]

I meant character, not code point, as should have become clear from
the following. I think that Thomas Koenig meant "character", too, but
he may have been unaware of the difference between "character" and
"Unicode code point".

I don't know of any language (or even library) that supports the notion
of "character" for Unicode strings.

My experiments with Telugu suggest that Emacs understands the concept
of a character at least for the Telugu script (in contrast to
decomposed Umlauts). If I press a cursor key in Telugu text, Emacs
advances to the next character, not the next code point. However, if
I press DEL or BS, it delets a code point.

Here's some text again for playing around with it:

తెలుగు లిపి

Anyway, the Emacs Lisp functions right-char (and, after testing, also left-char, forward-char, and backward-char) support the notion of
character at least for some scripts. That may be the result of an
interaction with the redisplay code that you mention later, but in
that case it's that code that knows about characters in Unicode.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Fuld@[email protected] to comp.arch on Sun May 26 03:50:46 2024

From Newsgroup: comp.arch

John Savard wrote:

On Sat, 18 May 2024 17:11:32 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

and
because it was supposedly "self documenting", easier for managers,
etc. to see how the program worked.

Of course, if they designed COBOL that way, why did they include a
statement that let you re-direct GOTO statements from elsewhere in a
program?

That feature (Alter GOTO) was also in Fortran, as the, long since
deprecated, assigned GOTO statement. I believe they were there to
support some older computers that didn't have indexed jump/branch
instructions, so achieved the effect by modifying the branch
destination in the instruction itself. And yes, it wwas ugly and made comprehension of the program, and also debugging it, much harder.

I mean, that was just asking for dishonest programmers to direct the
odd pennies into their bank accounts and so on.

Not really. You had to Alter the goto statement to some pre-existing
label, not just anywhere in the code.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@[email protected] to comp.arch on Sun May 26 08:33:50 2024

From Newsgroup: comp.arch

Stephen Fuld <[email protected]d> schrieb:

John Savard wrote:

On Sat, 18 May 2024 17:11:32 -0000 (UTC), "Stephen Fuld"
<[email protected]d> wrote:

and
because it was supposedly "self documenting", easier for managers,
etc. to see how the program worked.

Of course, if they designed COBOL that way, why did they include a
statement that let you re-direct GOTO statements from elsewhere in a
program?

That feature (Alter GOTO) was also in Fortran, as the, long since
deprecated, assigned GOTO statement.

Assigned is

ASSIGN 10 to N

GOTO N (10, 20, 30, 40)

10 CONTINUE

which I don't think is what John S. is describing.

What old FORTRAN compilers had was, for debugging, an AT statement,
which sucked control from the statement into a DEBUG section, without visibility at the place where it came from. The proverbial COME FROM statement, used as a debugging aid; in the DEBUG section, variables
could be printed _or changed_.

Rumor has it that the AD statement was regularly abused, so there
were a lot of programs which did not run cocrrectly unless debugging
was enabled...
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@[email protected] to comp.arch on Sun May 26 10:16:27 2024

From Newsgroup: comp.arch

Thomas Koenig <[email protected]> schrieb:

Rumor has it that the AD statement was regularly abused,

s/AD/AT
--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@[email protected] to comp.arch on Mon May 27 07:34:48 2024

From Newsgroup: comp.arch

On Sat, 18 May 2024 05:29:20 GMT, Anton Ertl wrote:

Stefan Monnier <[email protected]> writes:

Does Unicode even has the notion of "character", really?

AFAIK it does not.

It uses terms like “grapheme” and “text element” for the concept, leaving
“character” without a fixed meaning.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@[email protected] to comp.arch on Mon May 27 07:40:42 2024

From Newsgroup: comp.arch

On Wed, 22 May 2024 15:38:51 -0400, Stefan Monnier wrote:

I don't know of any language (or even library) that supports the notion
of "character" for Unicode strings. 🙁

Surely a “character” (or “grapheme” I think is (one of) the Unicode terms)
is (represented by) a non-combining code point combined with all the immediately-following combining code points.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@[email protected] to comp.arch on Mon May 27 07:42:32 2024

From Newsgroup: comp.arch

On Mon, 20 May 2024 11:46:20 GMT, Anton Ertl wrote:

Algol 60 does not standardize a program representation in characters (a
grave mistake fixed by most later programming languages ...

That would likely not have been considered feasible in 1960, given the
wide variation in character sets between computer systems. Even I/O was considered to be in the too-hard basket back then.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@[email protected] to comp.arch on Mon May 27 07:43:42 2024

From Newsgroup: comp.arch

On Mon, 20 May 2024 17:44:48 +0000, MitchAlsup1 wrote:

John Savard wrote:

Historical note: Algol was originally called IAL; remember what JOVIAL
stood for.

Who was Joe ?? in Jovial

Jules Schwartz <http://bitsavers.trailing-edge.com/pdf/sdc/jovial/Schwartz_-_The_Development_of_JOVIAL_1978.pdf>
--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@[email protected] to comp.arch on Mon May 27 07:45:59 2024

From Newsgroup: comp.arch

On Sun, 19 May 2024 15:32:49 -0600, John Savard wrote:

If Algol was supposed to be an _international_ algorithmic language,
why weren't its keywords taken from Latin or Esperanto, instead of
English?

Much of its syntax came from mathematics, which is international.

Semi-related question: are there non-English equivalents for mathematical operators like “grad”, “div” and “curl”?
--- Synchronet 3.20a-Linux NewsLink 1.114

From John Levine@[email protected] to comp.arch on Mon May 27 15:16:13 2024

From Newsgroup: comp.arch

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 22 May 2024 15:38:51 -0400, Stefan Monnier wrote:

I don't know of any language (or even library) that supports the notion
of "character" for Unicode strings. 🙁

Surely a “character” (or “grapheme” I think is (one of) the Unicode terms)
is (represented by) a non-combining code point combined with all the >immediately-following combining code points.

Take another look at the table I referred to yesterday. When you have
ZWJ the rules of what combines with what gets awfully complicated.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@[email protected] to comp.arch on Tue May 28 01:08:06 2024

From Newsgroup: comp.arch

On Mon, 27 May 2024 15:16:13 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 22 May 2024 15:38:51 -0400, Stefan Monnier wrote:

I don't know of any language (or even library) that supports the
notion of "character" for Unicode strings. 🙁

Surely a “character” (or “grapheme” I think is (one of) the Unicode >> terms) is (represented by) a non-combining code point combined with all
the immediately-following combining code points.

Take another look at the table I referred to yesterday. When you have
ZWJ the rules of what combines with what gets awfully complicated.

ZWJ is classed as “punctuation”, and has no combining class. So it forms a “character” or “grapheme” it its own right.
--- Synchronet 3.20a-Linux NewsLink 1.114

From John Levine@[email protected] to comp.arch on Tue May 28 01:25:38 2024

From Newsgroup: comp.arch

According to Lawrence D'Oliveiro <[email protected]d>:

On Mon, 27 May 2024 15:16:13 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 22 May 2024 15:38:51 -0400, Stefan Monnier wrote:

I don't know of any language (or even library) that supports the
notion of "character" for Unicode strings. 🙁

Surely a “character” (or “grapheme” I think is (one of) the Unicode >>> terms) is (represented by) a non-combining code point combined with all
the immediately-following combining code points.

Take another look at the table I referred to yesterday. When you have
ZWJ the rules of what combines with what gets awfully complicated.

ZWJ is classed as “punctuation”, and has no combining class. So it forms a
“character” or “grapheme” it its own right.

Really, you need to look at that combined emoji table I told you about yesterday.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@[email protected] to comp.arch on Tue May 28 01:29:31 2024

From Newsgroup: comp.arch

On Tue, 28 May 2024 01:25:38 -0000 (UTC), John Levine wrote:

Really, you need to look at that combined emoji table I told you about yesterday.

I’m just telling you what the official Unicode spec says.
--- Synchronet 3.20a-Linux NewsLink 1.114

From John Levine@[email protected] to comp.arch on Tue May 28 01:36:22 2024

From Newsgroup: comp.arch

It appears that Lawrence D'Oliveiro <[email protected]d> said:

On Tue, 28 May 2024 01:25:38 -0000 (UTC), John Levine wrote:

Really, you need to look at that combined emoji table I told you about
yesterday.

I’m just telling you what the official Unicode spec says.

Um, so am I. Those nine code point things are supposed to display
as a single little picture, regardless of what some other bit of
the spec may assert about ZWJ.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@[email protected] to comp.arch on Tue May 28 17:04:20 2024

From Newsgroup: comp.arch

Lawrence D'Oliveiro <[email protected]d> schrieb:

On Sun, 19 May 2024 15:32:49 -0600, John Savard wrote:

If Algol was supposed to be an _international_ algorithmic language,
why weren't its keywords taken from Latin or Esperanto, instead of
English?

Much of its syntax came from mathematics, which is international.

Semi-related question: are there non-English equivalents for mathematical operators like “grad”, “div” and “curl”?

German has "grad", "div" and "rot". People also use the nabla
operator, which I personally don't like.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@[email protected] to comp.arch on Tue May 28 16:37:22 2024

From Newsgroup: comp.arch

Anyway, the Emacs Lisp functions right-char (and, after testing, also left-char, forward-char, and backward-char) support the notion of
character at least for some scripts. That may be the result of an interaction with the redisplay code that you mention later, but in
that case it's that code that knows about characters in Unicode.

Indeed, the concept is somewhat visible, but it's not really exposed in
the language. I think what you're seeing is implemented elsewhere than
in `forward-char`, it's a part of the interactive loop which sees that
after `forward-char` you end up "in the middle" of a composition and it
moves the point further, based on information that mostly belongs to the redisplay code.

Try `C-u 2 C-f` and I suspect you'll see that it doesn't always advance
by 2 characters but rather it advances by "2 code points + rounding up
to the next character boundary".

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@[email protected] to comp.arch on Tue May 28 16:53:14 2024

From Newsgroup: comp.arch

Um, so am I. Those nine code point things are supposed to display
as a single little picture, regardless of what some other bit of
the spec may assert about ZWJ.

Maybe it's a good time to start taking bets for which will be the year
that Unicode becomes Turing complete?

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@[email protected] (Anton Ertl) to comp.arch on Wed May 29 06:59:55 2024

From Newsgroup: comp.arch

Stefan Monnier <[email protected]> writes:

Anyway, the Emacs Lisp functions right-char (and, after testing, also
left-char, forward-char, and backward-char) support the notion of
character at least for some scripts. That may be the result of an
interaction with the redisplay code that you mention later, but in
that case it's that code that knows about characters in Unicode.

Indeed, the concept is somewhat visible, but it's not really exposed in
the language. I think what you're seeing is implemented elsewhere than
in `forward-char`, it's a part of the interactive loop which sees that
after `forward-char` you end up "in the middle" of a composition and it
moves the point further, based on information that mostly belongs to the >redisplay code.

Try `C-u 2 C-f` and I suspect you'll see that it doesn't always advance
by 2 characters but rather it advances by "2 code points + rounding up
to the next character boundary".

Confirmed. So Emacs Lisp has a codepoint-oriented interface and then
needs to compensate for that elsewhere. This does not indicate that a codepoint-oriented interface is a good idea, rather the opposite.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@[email protected] (Anton Ertl) to comp.arch on Wed May 29 08:07:50 2024

From Newsgroup: comp.arch

Lawrence D'Oliveiro <[email protected]d> writes:

On Mon, 20 May 2024 11:46:20 GMT, Anton Ertl wrote:

Algol 60 does not standardize a program representation in characters (a
grave mistake fixed by most later programming languages ...

That would likely not have been considered feasible in 1960, given the
wide variation in character sets between computer systems.

COBOL did it. LISP did it. It was feasible in 1960. It's just that
the Algol 60 committee did not want to go there. And the Algol 68
committee did not want to go there even though ASCII was standardized
in 1963, and Algol 68 was only finished in 1974 AFAIK.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@[email protected] to comp.arch on Wed May 29 10:44:21 2024

From Newsgroup: comp.arch

Confirmed. So Emacs Lisp has a codepoint-oriented interface and then
needs to compensate for that elsewhere. This does not indicate that a codepoint-oriented interface is a good idea, rather the opposite.

Note that the "round to the next character boundary" is actually
generalized to non-Unicode concepts: you can mark a chunk of text as
being "intangible" or make it invisible and the "round up" will
correspondingly move to the next boundary to avoid the cursor being in
the middle of an invisible or intangible chunk of text.

I'm not sure the codepoint-oriented API is the best option, but it's not completely clear what *is* the best option. You mention a byte-oriented
API and you might be right that it's a better option, but in the case of
Emacs that's what we used in Emacs-20.1 but it worked really poorly
because of backward compatibility issues. I think if we started from
scratch now (i.e. without having to contend with backward compatibility,
and with a better understanding of Unicode (which barely existed back
then)) it might work better, indeed, but that's not been an option 🙁

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@[email protected] to comp.arch on Thu May 30 02:50:33 2024

From Newsgroup: comp.arch

On Wed, 29 May 2024 08:07:50 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Mon, 20 May 2024 11:46:20 GMT, Anton Ertl wrote:

Algol 60 does not standardize a program representation in characters
(a grave mistake fixed by most later programming languages ...

That would likely not have been considered feasible in 1960, given the
wide variation in character sets between computer systems.

COBOL did it. LISP did it.

And so did Fortran. They all did it by severely curtailing their allowed character sets.

It's just that the Algol 60 committee did not want to go there.

They wanted symbols like “÷”, “×”, “↑”, “≤”, “≥”, “≠”, “≡”, “⊃”, “∨”, “∧”,
“¬” ... you get the idea. I don’t any computer system on earth could provide all those symbols at the time, or even, say, 20 years later.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Fuld@[email protected] to comp.arch on Thu May 30 03:21:13 2024

From Newsgroup: comp.arch

Lawrence D'Oliveiro wrote:

snip

They wanted symbols like “÷”, “×”, “↑”, “≤”, “≥”, “≠”, “≡”, “⊃”, “∨”,
“∧”, “¬” ... you get the idea. I don’t any computer system on earth
could provide all those symbols at the time, or even, say, 20 years
later.

See APL. So many symbols that the language is almost impossible to
read without a significant investment in learning them.

https://en.wikipedia.org/wiki/APL_syntax_and_symbols#Monadic_functions

Please note that I am not advocating this. It is at the opposite end
of the spectrum from COBOL where you could get by with no special
characters beyond periods. Neither was a good choice.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Rentsch@[email protected] to comp.arch on Wed May 29 21:47:52 2024

From Newsgroup: comp.arch

"Stephen Fuld" <[email protected]d> writes:

Lawrence D'Oliveiro wrote:

snip

They wanted symbols like [...]

See APL. So many symbols that the language is almost impossible to
read without a significant investment in learning them.

https://en.wikipedia.org/wiki/APL_syntax_and_symbols#Monadic_functions

The problem with learning APL is not the character set. APL without
any special characters (which I actually have some experience using)
is still unlike any other programming language that existed in the
1960s or 1970s.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Fuld@[email protected] to comp.arch on Thu May 30 06:12:11 2024

From Newsgroup: comp.arch

Tim Rentsch wrote:

"Stephen Fuld" <[email protected]d> writes:

Lawrence D'Oliveiro wrote:

snip

They wanted symbols like [...]

See APL. So many symbols that the language is almost impossible to
read without a significant investment in learning them.

https://en.wikipedia.org/wiki/APL_syntax_and_symbols#Monadic_functions

The problem with learning APL is not the character set. APL without
any special characters (which I actually have some experience using)
is still unlike any other programming language that existed in the
1960s or 1970s.

OK, but my main point was to show, by counter example, the error of
Lawrence's statement quoted below

I don�t any computer system on earth could
provide all those symbols at the time, or even, say, 20 years later.

If the part about the difficulty of learning APL was wrong, then I
apologise.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Rentsch@[email protected] to comp.arch on Thu May 30 05:38:00 2024

From Newsgroup: comp.arch

"Stephen Fuld" <[email protected]d> writes:

Tim Rentsch wrote:

"Stephen Fuld" <[email protected]d> writes:

Lawrence D'Oliveiro wrote:

snip

They wanted symbols like [...]

See APL. So many symbols that the language is almost impossible to
read without a significant investment in learning them.

https://en.wikipedia.org/wiki/APL_syntax_and_symbols#Monadic_functions

The problem with learning APL is not the character set. APL without
any special characters (which I actually have some experience using)
is still unlike any other programming language that existed in the
1960s or 1970s.

OK, but my main point was to show, by counter example, the error of Lawrence's statement quoted below

I see. I misunderstood the point of what you were saying. Sorry
about that.

I don't any computer system on earth could provide all those
symbols at the time, or even, say, 20 years later.

If the part about the difficulty of learning APL was wrong, then I
apologise.

No apology needed. Even if the APL character set wasn't the main
source of the difficulty, there is no question that the unusual
choice of operator characters used contributed to the effort needed
to understand and use APL.
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@[email protected] (Anton Ertl) to comp.arch on Thu May 30 16:25:46 2024

From Newsgroup: comp.arch

Stefan Monnier <[email protected]> writes:

I'm not sure the codepoint-oriented API is the best option, but it's not >completely clear what *is* the best option. You mention a byte-oriented
API and you might be right that it's a better option, but in the case of >Emacs that's what we used in Emacs-20.1 but it worked really poorly
because of backward compatibility issues. I think if we started from
scratch now (i.e. without having to contend with backward compatibility,
and with a better understanding of Unicode (which barely existed back
then)) it might work better, indeed, but that's not been an option

Plus, editors are among the very few uses where you have to deal with individual characters, so the "treat it as opaque string" approach
that works so well for most other code is not good enough there. The command-line editor of Gforth is one case where we use the xchar words
(those for dealing with code points of UTF-8).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@[email protected] to comp.arch on Thu May 30 14:01:53 2024

From Newsgroup: comp.arch

The problem with learning APL is not the character set. APL without
any special characters (which I actually have some experience using)
is still unlike any other programming language that existed in the
1960s or 1970s.

There have been a few languages that took similar approaches, but the
most recent and successful I've heard of is [jq](https://en.wikipedia.org/wiki/Jq_%28programming_language%29).

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From John Savard@[email protected] to comp.arch on Thu May 30 22:19:14 2024

From Newsgroup: comp.arch

On Wed, 29 May 2024 08:07:50 GMT, [email protected]
(Anton Ertl) wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Mon, 20 May 2024 11:46:20 GMT, Anton Ertl wrote:

Algol 60 does not standardize a program representation in characters (a
grave mistake fixed by most later programming languages ...

That would likely not have been considered feasible in 1960, given the >>wide variation in character sets between computer systems.

COBOL did it. LISP did it. It was feasible in 1960. It's just that
the Algol 60 committee did not want to go there.

There was a famous article by Bob Bemer in 1960 in the Communications
of the ACM in which he gave a talbe of all this variation in character
sets between computers. This helped spur the adoption of ASCII.

Algol 60 was intended as an International Algorithmic Language. In
fact, that's what Algol was first called, hence JOVIAL. So it is _not_ particularly hard for me to believe that the international committee
behind Algol 60 wished to support a wider variety of computers than
the people behind COBOL and LISP. Yes, those languages, unlike
FORTRAN, weren't the creations of a single manufacturer.

But they _were_ fairly U.S. - centric, and Algol was *not*. For
example, there were British computer systems that offered Algol
compilers that based their character sets on modified 5-unit
teleprinters.

John Savard
--- Synchronet 3.20a-Linux NewsLink 1.114

From John Savard@[email protected] to comp.arch on Thu May 30 22:22:34 2024

From Newsgroup: comp.arch

On Thu, 30 May 2024 02:50:33 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

And so did Fortran. They all did it by severely curtailing their allowed >character sets.

It's just that the Algol 60 committee did not want to go there.

They wanted symbols like ��, �ה, �?�, �?�, �?�, �?�, �?�, �?�, �?�, �?�, >�� ... you get the idea. I don�t any computer system on earth could
provide all those symbols at the time, or even, say, 20 years later.

Well, the 120 character chain for the STRETCH computer's printer
handled Algol's character set. And so did the punched card code for a
couple of Russian computers. So the attempt was made.

And then there was the LISP machine, which started life with the
infamous "Space Cadet" computer.

Today, of course, we have Unicode, but that doesn't mean the entire
Algol character set is conveniently accessible directly from the
keyboard.

John Savard
--- Synchronet 3.20a-Linux NewsLink 1.114

From John Savard@[email protected] to comp.arch on Thu May 30 22:25:47 2024

From Newsgroup: comp.arch

On Thu, 30 May 2024 06:12:11 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

If the part about the difficulty of learning APL was wrong, then I
apologise.

I would not say that it was wrong. APL "without special characters"
was achieved by way of a transliteration scheme, where short codes
represented the special characters. So instead of memorizing funny
shapes, you memorized cryptic abbreviations.

So the character set was _still_ the source of the difficulty of
learning APL even if you happened to be using an implementation that
didn't have any special characters.

John Savard
--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@[email protected] to comp.arch on Fri May 31 12:59:42 2024

From Newsgroup: comp.arch

On Thu, 30 May 2024 22:19:14 -0600
John Savard <[email protected]d> wrote:

But they _were_ fairly U.S. - centric, and Algol was *not*. For
example,

U.S.-centric vs U.S. eccentric. http://www.cs.yale.edu/homes/perlis-alan/quotes.html

Actually I am pretty sure that "eccentric" is not a fair
characterisation of his personality, but can't resist.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Rentsch@[email protected] to comp.arch on Fri May 31 09:47:58 2024

From Newsgroup: comp.arch

John Savard <[email protected]d> writes:

On Thu, 30 May 2024 06:12:11 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

If the part about the difficulty of learning APL was wrong, then I
apologise.

I would not say that it was wrong. APL "without special characters"
was achieved by way of a transliteration scheme, where short codes represented the special characters. So instead of memorizing funny
shapes, you memorized cryptic abbreviations.

So the character set was _still_ the source of the difficulty of
learning APL even if you happened to be using an implementation that
didn't have any special characters.

The character set was a source of some of the difficulty of
learning APL. Certainly not all of it.
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@[email protected] to comp.arch on Fri May 31 12:14:19 2024

From Newsgroup: comp.arch

On 5/30/2024 11:25 AM, Anton Ertl wrote:

Stefan Monnier <[email protected]> writes:

I'm not sure the codepoint-oriented API is the best option, but it's not
completely clear what *is* the best option. You mention a byte-oriented
API and you might be right that it's a better option, but in the case of
Emacs that's what we used in Emacs-20.1 but it worked really poorly
because of backward compatibility issues. I think if we started from
scratch now (i.e. without having to contend with backward compatibility,
and with a better understanding of Unicode (which barely existed back
then)) it might work better, indeed, but that's not been an option

Plus, editors are among the very few uses where you have to deal with individual characters, so the "treat it as opaque string" approach
that works so well for most other code is not good enough there. The command-line editor of Gforth is one case where we use the xchar words
(those for dealing with code points of UTF-8).

Yeah.

For text editors, this is one of the few cases it makes sense to use 32
or 64 bit characters (say, combining the 'character' with some
additional metadata such as formatting).

Though, one thing that makes sense for text editors is if only the
"currently being edited" lines are fully unpacked, whereas the others
can remain in a more compact form (such as UTF-8), and are then unpacked
as they come into view (say, treating the editor window as a 32-entry
modulo cache or similar).

For the rest, say, one can have, say, a big buffer, with an array of
lines giving the location and size of the line's text in the buffer.

If a line is modified, it can be reallocated at the end of the buffer,
and if the buffer gets full, it can be "repacked" and/or expanded as
needed. When written back to a file, the buffer lines can be emitted
in-order to the text file.

Not entirely sure how other text editors manage things here, not really
looked into it.

- anton

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@[email protected] (MitchAlsup1) to comp.arch on Fri May 31 17:21:53 2024

From Newsgroup: comp.arch

BGB wrote:

On 5/30/2024 11:25 AM, Anton Ertl wrote:

Stefan Monnier <[email protected]> writes:

I'm not sure the codepoint-oriented API is the best option, but it's
not
completely clear what *is* the best option. You mention a
byte-oriented
API and you might be right that it's a better option, but in the case
of
Emacs that's what we used in Emacs-20.1 but it worked really poorly
because of backward compatibility issues. I think if we started from
scratch now (i.e. without having to contend with backward
compatibility,
and with a better understanding of Unicode (which barely existed back
then)) it might work better, indeed, but that's not been an option

Plus, editors are among the very few uses where you have to deal with
individual characters, so the "treat it as opaque string" approach
that works so well for most other code is not good enough there. The
command-line editor of Gforth is one case where we use the xchar words
(those for dealing with code points of UTF-8).

Yeah.

For text editors, this is one of the few cases it makes sense to use 32

or 64 bit characters (say, combining the 'character' with some
additional metadata such as formatting).

Though, one thing that makes sense for text editors is if only the "currently being edited" lines are fully unpacked, whereas the others
can remain in a more compact form (such as UTF-8), and are then
unpacked

as they come into view (say, treating the editor window as a 32-entry
modulo cache or similar).

For the rest, say, one can have, say, a big buffer, with an array of
lines giving the location and size of the line's text in the buffer.

In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif,
..}
along with text from different fonts and different backgrounds on a per character basis.

If a line is modified, it can be reallocated at the end of the buffer,
and if the buffer gets full, it can be "repacked" and/or expanded as
needed. When written back to a file, the buffer lines can be emitted in-order to the text file.

Not entirely sure how other text editors manage things here, not really

looked into it.

If you think about it with the above features, you quickly realize it
is not just text anymore.

- anton

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@[email protected] to comp.arch on Fri May 31 12:55:59 2024

From Newsgroup: comp.arch

On 5/31/2024 12:21 PM, MitchAlsup1 wrote:

BGB wrote:

On 5/30/2024 11:25 AM, Anton Ertl wrote:

Stefan Monnier <[email protected]> writes:

I'm not sure the codepoint-oriented API is the best option, but it's
not
completely clear what *is* the best option. You mention a
byte-oriented
API and you might be right that it's a better option, but in the case
of
Emacs that's what we used in Emacs-20.1 but it worked really poorly
because of backward compatibility issues. I think if we started from >>>> scratch now (i.e. without having to contend with backward
compatibility,
and with a better understanding of Unicode (which barely existed back
then)) it might work better, indeed, but that's not been an option

Plus, editors are among the very few uses where you have to deal with
individual characters, so the "treat it as opaque string" approach
that works so well for most other code is not good enough there. The
command-line editor of Gforth is one case where we use the xchar words
(those for dealing with code points of UTF-8).

Yeah.

For text editors, this is one of the few cases it makes sense to use 32

or 64 bit characters (say, combining the 'character' with some
additional metadata such as formatting).

Though, one thing that makes sense for text editors is if only the
"currently being edited" lines are fully unpacked, whereas the others
can remain in a more compact form (such as UTF-8), and are then
unpacked

as they come into view (say, treating the editor window as a 32-entry
modulo cache or similar).

For the rest, say, one can have, say, a big buffer, with an array of
lines giving the location and size of the line's text in the buffer.

In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif,
..}
along with text from different fonts and different backgrounds on a per character basis.

Errm, I think we call this a word processor, not a text editor.

Granted, text editors don't usually store font or formatting information
in the text itself, but rather it exists temporarily for things like
"syntax highlighting".

If a line is modified, it can be reallocated at the end of the buffer,
and if the buffer gets full, it can be "repacked" and/or expanded as
needed. When written back to a file, the buffer lines can be emitted
in-order to the text file.

Not entirely sure how other text editors manage things here, not really

looked into it.

If you think about it with the above features, you quickly realize it
is not just text anymore.

But, word processors are their own category...

Typically, they also have their own specialized formats (though, "big
blob of XML inside a ZIP package" seems to have become popular).

Whereas text-editors typically use plain ASCII/UTF-8/UTF-16 files...
The great "feature creep" in text editors is mostly that modern ones
support syntax highlighting and emojis.

An intermediate option would be a wysiwyg editor that does MediaWiki or Markdown. Though, annoyingly, there don't seem to be any that exist as standalone desktop programs (seemingly invariably they are written in JavaScript or similar and intended to operate inside a browser).

I might eventually need to get around to writing something like this
(mostly because I use MediaWiki notation for some of my own
documentation). Also arguably mode advanced than the system used by
"info" and "man", though a tool along these lines could make sense (but possibly as an intermediate, with an interface more like "man" but able
to jump between documents more like "info").

Also, bug hunt is annoying. Find/fix one bug, but more bugs remain...
My project is seemingly in a rather buggy state right at the moment.

But, I guess, did add things like file redirection and similar, along
with a few more standard commands.

So, in the working version, technically things like "cat file1 > file2"
or "program > file" and similar are now technically possible...

But, also, everything has turned into a crapstorm of crashes...

- anton

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@[email protected] (MitchAlsup1) to comp.arch on Fri May 31 19:12:49 2024

From Newsgroup: comp.arch

BGB wrote:

On 5/31/2024 12:21 PM, MitchAlsup1 wrote:

For the rest, say, one can have, say, a big buffer, with an array of
lines giving the location and size of the line's text in the buffer.

In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif,
..}
along with text from different fonts and different backgrounds on a per
character basis.

Errm, I think we call this a word processor, not a text editor.

So, you are calling AOL e-mail editor a word processor ??? !!?! Gasp !
And every modern forum editor (this one not included) word processors
!!

Me thinks your definition is overly inclusive.

Granted, text editors don't usually store font or formatting
information

in the text itself, but rather it exists temporarily for things like
"syntax highlighting".

If a line is modified, it can be reallocated at the end of the buffer,
and if the buffer gets full, it can be "repacked" and/or expanded as
needed. When written back to a file, the buffer lines can be emitted
in-order to the text file.

Not entirely sure how other text editors manage things here, not really

looked into it.

If you think about it with the above features, you quickly realize it
is not just text anymore.

But, word processors are their own category...

Typically, they also have their own specialized formats (though, "big
blob of XML inside a ZIP package" seems to have become popular).

Whereas text-editors typically use plain ASCII/UTF-8/UTF-16 files...
The great "feature creep" in text editors is mostly that modern ones
support syntax highlighting and emojis.

An intermediate option would be a wysiwyg editor that does MediaWiki or

Markdown. Though, annoyingly, there don't seem to be any that exist as standalone desktop programs (seemingly invariably they are written in JavaScript or similar and intended to operate inside a browser).

I might eventually need to get around to writing something like this
(mostly because I use MediaWiki notation for some of my own
documentation). Also arguably mode advanced than the system used by
"info" and "man", though a tool along these lines could make sense (but

possibly as an intermediate, with an interface more like "man" but able

to jump between documents more like "info").

Also, bug hunt is annoying. Find/fix one bug, but more bugs remain...
My project is seemingly in a rather buggy state right at the moment.

But, I guess, did add things like file redirection and similar, along
with a few more standard commands.

So, in the working version, technically things like "cat file1 > file2"

or "program > file" and similar are now technically possible...

But, also, everything has turned into a crapstorm of crashes...

- anton

--- Synchronet 3.20a-Linux NewsLink 1.114

From John Levine@[email protected] to comp.arch on Fri May 31 19:47:36 2024

From Newsgroup: comp.arch

According to Michael S <[email protected]>:

U.S.-centric vs U.S. eccentric. >http://www.cs.yale.edu/homes/perlis-alan/quotes.html

Actually I am pretty sure that "eccentric" is not a fair
characterisation of his personality, but can't resist.

He was my thesis advisor and he was pretty eccentric. In a nice way,
but still quite a character.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@[email protected] (Scott Lurndal) to comp.arch on Fri May 31 21:01:13 2024

From Newsgroup: comp.arch

[email protected] (MitchAlsup1) writes:

BGB wrote:

On 5/31/2024 12:21 PM, MitchAlsup1 wrote:

For the rest, say, one can have, say, a big buffer, with an array of
lines giving the location and size of the line's text in the buffer.

In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif,
..}
along with text from different fonts and different backgrounds on a per
character basis.

Errm, I think we call this a word processor, not a text editor.

So, you are calling AOL e-mail editor a word processor ???

Yep.

And every modern forum editor (this one not included) word processors

Yep. They're certainly not text editors along the lines of vim or emacs.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@[email protected] (MitchAlsup1) to comp.arch on Fri May 31 21:05:36 2024

From Newsgroup: comp.arch

John Levine wrote:

According to Michael S <[email protected]>:

U.S.-centric vs U.S. eccentric. >>http://www.cs.yale.edu/homes/perlis-alan/quotes.html

Actually I am pretty sure that "eccentric" is not a fair
characterisation of his personality, but can't resist.

He was my thesis advisor and he was pretty eccentric. In a nice way,
but still quite a character.

Back in my day, eccentric was used in the British fashion to point out
a person with certain qualities that make him instantly memorable, but
not in any bad way. The Characters on Monty Python were eccentric !!

Now it means a person with creepy qualities.

My how the language has migrated.
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@[email protected] to comp.arch on Fri May 31 17:34:04 2024

From Newsgroup: comp.arch

On 5/31/2024 4:01 PM, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

BGB wrote:

On 5/31/2024 12:21 PM, MitchAlsup1 wrote:

For the rest, say, one can have, say, a big buffer, with an array of >>>>> lines giving the location and size of the line's text in the buffer.

In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif, >>>> ..}
along with text from different fonts and different backgrounds on a per >>>> character basis.

Errm, I think we call this a word processor, not a text editor.

So, you are calling AOL e-mail editor a word processor ???

Yep.

And every modern forum editor (this one not included) word processors

Yep. They're certainly not text editors along the lines of vim or emacs.

My definition is, say:
Text editor:
Notepad, Notepad2, Notepad++, GEdit, SciTe, etc...
VI, Emacs, Nano, etc, also count.
Line Editor:
Ed, Edlin, etc.
Word Processor:
Word, {Open/Libre}Office Writer, ...
WordPad (sorta)
...

The editors in a lot of email programs or forums are HTML or Markdown
WYSIWYG editors being used as an editor, but I would not consider them
as text-editors when used in this context.

About as soon as one allows things like dynamic formatting, images, and
other metadata that can't be expressed in bare ASCII or UTF-8 or
similar, it is no longer a text editor as I see it.

The fuzzy line here is mostly emojis, and other effects that can be
shoehorned though UTF-8 or similar. Because, seemingly, the era of Plain
ASCII has mostly passed (though, it seems uncommon to use characters
outside of ASCII or 8859-1 / 1252 range all that often; apart from
random people sticking emojis in stuff).

Though, IIRC, if you try sticking emojis in a lot of text editors, they
will often render in monochrome or in non-combined forms, rather than
the full-color fully-graphical forms often expected in things like
messaging or chat.

So, for example, the "family" emoji might just render as the
man/woman/child emojis, with implicit zero-width-joiners.

Ironically, a set already exists in certain contexts in TestKern, mostly
for the character ranges inherited from Unifont (which apparently mostly contains the original set of ~ 200 emojis developed by NTT DoCoMo and
similar, which exist within the BMP).

Well, and with "quality" based on the automated algorithmic conversion
from 16x16 1bpp bitmap graphics to SDF (sorta hit/miss).

A different (more customized) font is used for 1252-range, mostly
because the Unifont graphics don't work well if scaled below 16x16, and
my strategy for the "base" characters was to design things mostly around
an 8x8 pixel cell.

Though, for the GUI text console and similar, I ended up going for 5x6
padded to 6x8, which doesn't really work much outside of ASCII (and
generally a bitmap font is used for the 6x8 and 8x8 cases; falling back
to trying to generate cells from the SDF if accessing characters outside
the ASCII or 1252 set, with results that are generally unreadable).

The smallest is 3x5 padded to 4x6, but this is barely passable for ASCII
and one needs to use their imagination for some of the character glyphs
(so I ended up going with 5x6/6x8 instead). I suspect that 3x5 is the
smallest size possible for semi-recognizable ASCII text.

But, one arguable merit to 3x5 is that it does allow fitting 80x25 text characters into 320x150 pixels, or 40x25 in 160x140 (roughly the same as
the screen on the original GameBoy).

--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online
Recent Visitors
- Fluid
  Wed May 29 18:19:13 2024
  from Wickliffe, Oh via Telnet
- Microbot
  Thu May 30 15:28:26 2024
  from Moore, Ok via Telnet
- Microbot
  Fri May 31 13:50:02 2024
  from Moore, Ok via Telnet
- Microbot
  Sat Jun 1 11:59:50 2024
  from Moore, Ok via Telnet
System Info

Sysop: DaiTengu

Location: Appleton, WI

Users: 762

Nodes: 10 (0 / 10)

Uptime: 104:20:18

Calls: 12,295

Calls today: 1

Files: 186,558

Messages: 2,254,837

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	762
Nodes:	10 (0 / 10)
Uptime:	104:20:18
Calls:	12,295
Calls today:	1
Files:	186,558
Messages:	2,254,837

Re: Unicode in strings

Who's Online

Recent Visitors

System Info