Forum: War Ensemble BBS

[Python-announce] TatSu v5.20.0

From =?UTF-8?Q?Juancarlo_A=C3=B1ez?=@[email protected] to comp.lang.python.announce on Sat May 23 16:42:40 2026

From Newsgroup: comp.lang.python.announce

v5.20.0 Update
==============

https://pypi.org/project/TatSu/5.20.0/

Sibling Projects
----------------

There are ports of TatSu to Go and Rust. They are functionally complete
with except for features (like synthetic classes) that rely on the
dynamic nature of Python.

铁修 TieXiu
...........

铁修 TieXiu is the port of TatSu to Rust. It features a PyO3 interface
os it’s also a Python library, but the benchmarks show that the
pure-Python parsers generated by TatSu are still more performant when
hosting from Python. See the TieXiu README for a discussion of the
performance limits of PEG parsers.

⻰OGoPEGo
..........

⻰OGoPEGo is the port of TatSu to Go. The implementation, being the most mature, is beautifully concise, and using the generated parsers has a simplicity closer to the Python style allowed by TatSu.

Internals
---------

- The algorithm for left-recursion analysis went over another round of
simplification and optimization. Then the analysis done in pegen, a
more efficient and theoretically-sound approach, was evaluated. All
tests pass with the pegen’s SCC (Strongly Connected Components)
algorithm, so the old-and-tried algorithm in TatSu was replaced.

Although left-recursion analysis is performed once per Grammar, before
any parsing, a simpler implementation makes this core part of TatSu
easier to maintain.

g2e (ANTLR to TatSu)
--------------------

- The g2e (ANTLR grammar to TatSu) translator has been revived and
significantly simplified. A working example (python3.tatsu, 551 lines)
is generated from Python 3’s full ANTLR grammar and passes
tatsu.compile().

- Removed the regex conversion approach for ANTLR token rules. ANTLR
lexer patterns (notably \uXXXX escapes) are not viable as Python regex
patterns. Non-trivial token rules now emit Fail() instead of Pattern.

- g2e substitutes simple token definitions (like
OPEN_PAREN : '(' {opened++;};) for their right hand side (just '(')
for better looking grammars. For complex token definitions ANTLR uses
a special syntax which is not that of Python-compatible (PCRE2)
regular expressions, so g2e omits them, leaving it to the user to
decide how to handle those tokens. In many cases a single pattern
match is enough for the grammar of interest, and a semantic rule may
be added to validate additional conditions that the parsed token
should meet.

- Streamlined generated grammar output — removed unnecessary
parenthesization:

- Single token references in alternatives no longer wrapped in extra
parens: (NEWLINE) → NEWLINE.
- Groups inside [...], {...}, {...}+ unwrapped: [('as' NAME)] →
['as' NAME], {('.' NAME)} → {'.' NAME}.
- Rule deduplication by name handles tokens {} declarations that
collide with defined rules (e.g. INDENT/DEDENT).

- Token name resolution now uses uppercase names consistently.

- The g2e example (examples/g2e) uses the old, LL(1) Python grammar.
Now, since Python’s PEG parser the actual grammar is a much simpler
one. The example is kept as it was to demonstrate g2e’s behavior over
a complex grammar.

Tools
-----

- A new --recursion-limit (-R1) option was added to the tatsu CLI tool
so it can handle large and deeply recursive input grammars. When used
as a library, the host program should call sys.setrecursionlimit()
when required by the grammar complexity.

- Added better rendering to FailedParse.__str__(). Now a code fragment
and line numbers are shown, as in many modern tools.

error: expecting 'world'
--> example:1:7
|
1 | hello missing
| ^ expecting 'world'

-> start

JSON
----

- tatsu.ebnf define rules for JSON literals, so true, false, and null,
may be used where previously only True, False, and None were
recognized. The Python literals are still honored as before, as well
as the boolean rule resolving to True for non-falsy values. These
literals are only used in grammar directives, as parsing is only
interested in the strings that match a Token or Pattern.

- Now a Grammar can be imported from the JSON produced by
model.asjson(). Roundtrip has been tested and it works. New methods
Grammar.load(value: Any) -> Grammar and
Grammar.loads(json: str) -> Grammar make the functionality available.

class Grammar:
@staticmethod
def load(value: Any) -> Grammar:
from .json import load_grammar
return load_grammar(value)

@staticmethod
def loads(value: str) -> Grammar:
from .json import loads_grammar

return loads_grammar(value)

Grammar Syntax
--------------

- The definition of the DEDENT rule in the TatSu grammar is used to
support EBNF notations with no rule-terminatiors and grammars with no
blank lines * rules. The pattern used in the rule was incorrectly
consuming the first non-space character starting the next rule. Fixed.

Now this is a valid EBNF definition:

grammar = r"""
@@grammar :: MiniJSON
@@nameguard :: False
@@whitespace :: /\s+/
start: value $

value: object | array | string | number | 'true' | 'false' |
'null'

object: '{' members? '}'
array: '[' elements? ']'
members: pair (',' pair)*
elements: value (',' value)*
pair: string ':' value
string: '"' CONTENT '"'
CONTENT: /[^"]*/
number: /-?\d+(\.\d+)?/
"""

---
--
Juancarlo Añez
mailto:[email protected]
--- Synchronet 3.22a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,123
Nodes:	10 (0 / 10)
Uptime:	36:22:32
Calls:	14,371
Files:	186,380
D/L today:	2,309 files (660M bytes)
Messages:	2,540,645

[Python-announce] TatSu v5.20.0

Who's Online

System Info