• [Python-announce] TatSu v5.20.0

    From =?UTF-8?Q?Juancarlo_A=C3=B1ez?=@[email protected] to comp.lang.python.announce on Sat May 23 16:42:40 2026
    From Newsgroup: comp.lang.python.announce

    v5.20.0 Update
    ==============

    https://pypi.org/project/TatSu/5.20.0/

    Sibling Projects
    ----------------

    There are ports of TatSu to Go and Rust. They are functionally complete
    with except for features (like synthetic classes) that rely on the
    dynamic nature of Python.

    铁修 TieXiu
    ...........

    铁修 TieXiu is the port of TatSu to Rust. It features a PyO3 interface
    os it’s also a Python library, but the benchmarks show that the
    pure-Python parsers generated by TatSu are still more performant when
    hosting from Python. See the TieXiu README for a discussion of the
    performance limits of PEG parsers.

    ⻰OGoPEGo
    ..........

    ⻰OGoPEGo is the port of TatSu to Go. The implementation, being the most mature, is beautifully concise, and using the generated parsers has a simplicity closer to the Python style allowed by TatSu.

    Internals
    ---------

    - The algorithm for left-recursion analysis went over another round of
    simplification and optimization. Then the analysis done in pegen, a
    more efficient and theoretically-sound approach, was evaluated. All
    tests pass with the pegen’s SCC (Strongly Connected Components)
    algorithm, so the old-and-tried algorithm in TatSu was replaced.

    Although left-recursion analysis is performed once per Grammar, before
    any parsing, a simpler implementation makes this core part of TatSu
    easier to maintain.

    g2e (ANTLR to TatSu)
    --------------------

    - The g2e (ANTLR grammar to TatSu) translator has been revived and
    significantly simplified. A working example (python3.tatsu, 551 lines)
    is generated from Python 3’s full ANTLR grammar and passes
    tatsu.compile().

    - Removed the regex conversion approach for ANTLR token rules. ANTLR
    lexer patterns (notably \uXXXX escapes) are not viable as Python regex
    patterns. Non-trivial token rules now emit Fail() instead of Pattern.

    - g2e substitutes simple token definitions (like
    OPEN_PAREN : '(' {opened++;};) for their right hand side (just '(')
    for better looking grammars. For complex token definitions ANTLR uses
    a special syntax which is not that of Python-compatible (PCRE2)
    regular expressions, so g2e omits them, leaving it to the user to
    decide how to handle those tokens. In many cases a single pattern
    match is enough for the grammar of interest, and a semantic rule may
    be added to validate additional conditions that the parsed token
    should meet.

    - Streamlined generated grammar output — removed unnecessary
    parenthesization:

    - Single token references in alternatives no longer wrapped in extra
    parens: (NEWLINE) → NEWLINE.
    - Groups inside [...], {...}, {...}+ unwrapped: [('as' NAME)] →
    ['as' NAME], {('.' NAME)} → {'.' NAME}.
    - Rule deduplication by name handles tokens {} declarations that
    collide with defined rules (e.g. INDENT/DEDENT).

    - Token name resolution now uses uppercase names consistently.

    - The g2e example (examples/g2e) uses the old, LL(1) Python grammar.
    Now, since Python’s PEG parser the actual grammar is a much simpler
    one. The example is kept as it was to demonstrate g2e’s behavior over
    a complex grammar.

    Tools
    -----

    - A new --recursion-limit (-R1) option was added to the tatsu CLI tool
    so it can handle large and deeply recursive input grammars. When used
    as a library, the host program should call sys.setrecursionlimit()
    when required by the grammar complexity.

    - Added better rendering to FailedParse.__str__(). Now a code fragment
    and line numbers are shown, as in many modern tools.

    error: expecting 'world'
    --> example:1:7
    |
    1 | hello missing
    | ^ expecting 'world'

    -> start

    JSON
    ----

    - tatsu.ebnf define rules for JSON literals, so true, false, and null,
    may be used where previously only True, False, and None were
    recognized. The Python literals are still honored as before, as well
    as the boolean rule resolving to True for non-falsy values. These
    literals are only used in grammar directives, as parsing is only
    interested in the strings that match a Token or Pattern.

    - Now a Grammar can be imported from the JSON produced by
    model.asjson(). Roundtrip has been tested and it works. New methods
    Grammar.load(value: Any) -> Grammar and
    Grammar.loads(json: str) -> Grammar make the functionality available.

    class Grammar:
    @staticmethod
    def load(value: Any) -> Grammar:
    from .json import load_grammar
    return load_grammar(value)

    @staticmethod
    def loads(value: str) -> Grammar:
    from .json import loads_grammar

    return loads_grammar(value)

    Grammar Syntax
    --------------

    - The definition of the DEDENT rule in the TatSu grammar is used to
    support EBNF notations with no rule-terminatiors and grammars with no
    blank lines * rules. The pattern used in the rule was incorrectly
    consuming the first non-space character starting the next rule. Fixed.

    Now this is a valid EBNF definition:

    grammar = r"""
    @@grammar :: MiniJSON
    @@nameguard :: False
    @@whitespace :: /\s+/
    start: value $

    value: object | array | string | number | 'true' | 'false' |
    'null'

    object: '{' members? '}'
    array: '[' elements? ']'
    members: pair (',' pair)*
    elements: value (',' value)*
    pair: string ':' value
    string: '"' CONTENT '"'
    CONTENT: /[^"]*/
    number: /-?\d+(\.\d+)?/
    """

    ---
    --
    Juancarlo Añez
    mailto:[email protected]
    --- Synchronet 3.22a-Linux NewsLink 1.2