From Newsgroup: comp.lang.prolog
Hi,
Ok I was looking at this learning challenge,
producing vector (y1,y2,y3,y4) from a vector
(x1,x2,x3,x4), System R can do it via least square?
| 0 0 0 1 | | x1 | | x4 |
| 0 0 1 0 | | x2 | = | x3 |
| 0 1 0 0 | | x3 | | x2 |
| 1 0 0 0 | | x4 | | x1 |
How it started:
"multiplicative RNNs arises naturally from a
proof-theoretic interpretation of next-token
prediction as nested intuitionistic implication"
Paul Tarau - 2026
https://arxiv.org/abs/2601.19915
How its going:
"Dave uses a PDP-11 to train a real Neural
Network complete with Transformers and
Attention so you can see them at their most basic."
Mr. Taskmanager - 2026
https://www.youtube.com/watch?v=OUE3FSIk46g
We see Doctor Frankstein in action from
the Bronze Age of Computing, producing
a Humunkulus, the progenitor of todays
Bulgakov Shuriks in the Hyperscale Age!
Bye
P.S.: My impression neither cut to the core, that
this incredible transformer most likely
produced this deterministic attention:
| -1 | * | k | + | 5 | = | k' |
Or differently expressed y_k = x_{5-k}.
How did the transformer do it? It produced
a neural network with 1216 parameters, but
didn't use embeddings or polar encoding
of positions. But if we strip the noise
and denoise from the position encoding,
the denoise is done via softmax. We somehow
must get the above, right? I still need to
verify my claim! BTW: The PDP-11 assembly
from 1979 uses wider example not with n=4
but with n=8.
--- Synchronet 3.21f-Linux NewsLink 1.2