Git Product home page Git Product logo

pegen's Introduction


Downloads PyPI version CI

What is this?

Pegen is the parser generator used in CPython to produce the parser used by the interpreter. It allows to produce PEG parsers from a description of a formal Grammar.

Installing

Install with pip or your favorite PyPi package manager.

pip install pegen

Documentation

The documentation is available here.

How to generate a parser

Given a grammar file compatible with pegen (you can write your own or start with one in the data directory), you can easily generate a parser by running:

python -m pegen <path-to-grammar-file> -o parser.py

This will generate a file called parser.py in the current directory. This can be used to parse code using the grammar that we just used:

python parser.py <file-with-code-to-parse>

As a demo: generate a Python parser from data/python.gram, and use the generated parser to parse and run tests/demo.py

make demo

How to contribute

See the instructions in the CONTRIBUTING.md file.

Differences with CPython's Pegen

This repository exists to distribute a version of the Python PEG parser generator used by CPython that can be installed via PyPI, with some improvements. Although the official PEG generator included in CPython can generate both Python and C code, this distribution of the generator only allows to generate Python code. This is due to the fact that the C code generated by the generator included in CPython includes a lot of implementation details and private headers that are not available for general usage.

The official PEG generator for Python 3.9 and later is now included in the CPython repo under Tools/peg_generator/. We aim to keep this repo in sync with the Python generator from that version of pegen.

See also PEP 617.

Repository structure

  • The src directory contains the pegen source (the package itself).
  • The tests directory contains the test suite for the pegen package.
  • The data directory contains some example grammars compatible with pegen. This includes a pure-Python version of the Python grammar.
  • The docs directory contains the documentation for the package.
  • The scripts directory contains some useful scripts that can be used for visualizing grammars, benchmarking and other usages relevant to the development of the generator itself.
  • The stories directory contains the backing files and examples for Guido's series on PEG parser.

Quick syntax overview

The grammar consists of a sequence of rules of the form:

    rule_name: expression

Optionally, a type can be included right after the rule name, which specifies the return type of the Python function corresponding to the rule:

    rule_name[return_type]: expression

If the return type is omitted, then Any is returned.

Grammar Expressions

# comment

Python-style comments.

e1 e2

Match e1, then match e2.

    rule_name: first_rule second_rule

e1 | e2

Match e1 or e2.

The first alternative can also appear on the line after the rule name for formatting purposes. In that case, a | must be used before the first alternative, like so:

    rule_name[return_type]:
        | first_alt
        | second_alt

( e )

Match e.

    rule_name: (e)

A slightly more complex and useful example includes using the grouping operator together with the repeat operators:

    rule_name: (e1 e2)*

[ e ] or e?

Optionally match e.

    rule_name: [e]

A more useful example includes defining that a trailing comma is optional:

    rule_name: e (',' e)* [',']

e*

Match zero or more occurrences of e.

    rule_name: (e1 e2)*

e+

Match one or more occurrences of e.

    rule_name: (e1 e2)+

s.e+

Match one or more occurrences of e, separated by s. The generated parse tree does not include the separator. This is otherwise identical to (e (s e)*).

    rule_name: ','.e+

&e

Succeed if e can be parsed, without consuming any input.

!e

Fail if e can be parsed, without consuming any input.

An example taken from the Python grammar specifies that a primary consists of an atom, which is not followed by a . or a ( or a [:

    primary: atom !'.' !'(' !'['

~

Commit to the current alternative, even if it fails to parse.

    rule_name: '(' ~ some_rule ')' | some_alt

In this example, if a left parenthesis is parsed, then the other alternative won’t be considered, even if some_rule or ‘)’ fail to be parsed.

Left recursion

PEG parsers normally do not support left recursion but Pegen implements a technique that allows left recursion using the memoization cache. This allows us to write not only simple left-recursive rules but also more complicated rules that involve indirect left-recursion like

  rule1: rule2 | 'a'
  rule2: rule3 | 'b'
  rule3: rule1 | 'c'

and "hidden left-recursion" like::

  rule: 'optional'? rule '@' some_other_rule

Variables in the Grammar

A sub-expression can be named by preceding it with an identifier and an = sign. The name can then be used in the action (see below), like this: ::

    rule_name[return_type]: '(' a=some_other_rule ')' { a }

Grammar actions

To avoid the intermediate steps that obscure the relationship between the grammar and the AST generation the PEG parser allows directly generating AST nodes for a rule via grammar actions. Grammar actions are language-specific expressions that are evaluated when a grammar rule is successfully parsed. These expressions can be written in Python. As an example of a grammar with Python actions, the piece of the parser generator that parses grammar files is bootstrapped from a meta-grammar file with Python actions that generate the grammar tree as a result of the parsing.

In the specific case of the PEG grammar for Python, having actions allows directly describing how the AST is composed in the grammar itself, making it more clear and maintainable. This AST generation process is supported by the use of some helper functions that factor out common AST object manipulations and some other required operations that are not directly related to the grammar.

To indicate these actions each alternative can be followed by the action code inside curly-braces, which specifies the return value of the alternative

    rule_name[return_type]:
        | first_alt1 first_alt2 { first_alt1 }
        | second_alt1 second_alt2 { second_alt1 }

If the action is ommited, a default action is generated:

  • If there's a single name in the rule in the rule, it gets returned.

  • If there is more than one name in the rule, a collection with all parsed expressions gets returned.

This default behaviour is primarily made for very simple situations and for debugging purposes.

As an illustrative example this simple grammar file allows directly generating a full parser that can parse simple arithmetic expressions and that returns a valid Python AST:

    start[ast.Module]: a=expr_stmt* ENDMARKER { ast.Module(body=a or []) }
    expr_stmt: a=expr NEWLINE { ast.Expr(value=a, EXTRA) }

    expr:
        | l=expr '+' r=term { ast.BinOp(left=l, op=ast.Add(), right=r, EXTRA) }
        | l=expr '-' r=term { ast.BinOp(left=l, op=ast.Sub(), right=r, EXTRA) }
        | term

    term:
        | l=term '*' r=factor { ast.BinOp(left=l, op=ast.Mult(), right=r, EXTRA) }
        | l=term '/' r=factor { ast.BinOp(left=l, op=ast.Div(), right=r, EXTRA) }
        | factor

    factor:
        | '(' e=expr ')' { e }
        | atom

    atom:
        | NAME
        | NUMBER

pegen's People

Contributors

0dminnimda avatar avdn avatar dfremont avatar edemaine avatar eric-vin avatar funkyfuture avatar isidentical avatar lucach avatar lysnikolaou avatar matthieudartiailh avatar mw66 avatar pablogsal avatar phorward avatar shumbo avatar yf-yang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pegen's Issues

python.gram: Incorrect column numbers in SyntaxError

Python's tokenize.TokenInfo structure has 0-indexed column numbers, but SyntaxError wants 1-indexed column numbers. See this discussion.

This code from python.gram is thus incorrect:

pegen/data/python.gram

Lines 306 to 309 in d130038

self._exception = SyntaxError(
message,
(self.filename, start[0], start[1], line)
)

On the other hand, this code from parser.py is correct:

return SyntaxError(message, (filename, tok.start[0], 1 + tok.start[1], tok.line))

Could be fixed as part of #41. @MatthieuDartiailh let me know what you prefer; also happy to help add 1+ in the two necessary spots.

[Feature] Optionally create `.pxd` files

If the creation of a parser using c was prohibited, because the original cpython pegen used too much internal functionality, then we can at least make python parser faster with cython.

Since pegen knows all the variables and types that users provide, it can use this information to create a .pxd file. This file will not interfere with pure python parser work, but it can be used by cython to create a faster compiled version of the parser.

I will implement this functionality anyway, but I wonder should I make it into PR and try to merge it here? Is this feature needed here?

Advice request - general purpose parser generator?

Hi, in your series you've mentions that:

The result may not be a great general-purpose PEG parser generator — there are already many of those (e.g. TatSu is written in Python and generates Python code)

After nine episodes and several years of developing pegen, if you were to define a grammar and generate a parser for a completely different language, would you do it using Pegen or stick to a third-party solution like TatSu?

Identifying unexpected token error positions

I hope it's ok to ask a general question about pegen here. I've built a parser following the blog posts and code here, and it generally works really nicely but one thing I found missing in the blog is how to handle unexpected tokens. As far as I understand, the recursive descent parser will continue to backtrack on unexpected input until it reaches the first rule again, unless you define an explicit rule to handle particular errors. I assume there is some strategy to identify the token that caused the error, like with how Python's parser knows the error in the following line is the *:

>>> 1*
  File "<stdin>", line 1
    1*
     ^
SyntaxError: invalid syntax

How/where is pegen handling this sort of error? Or, if pegen doesn't handle this error, where is Python's parser handling it (since I know it does!)?

Aside: while trying a different invalid Python syntax example I got something unexpected:

$ echo "hi(" -n | python -

With 3.9.2 this gives the error

  File "<stdin>", line 2
    
    ^
SyntaxError: unexpected EOF while parsing

which seems to be not showing the line with the error (but it's marking it). The same behaviour happens when hi( is in a file. Is this a bug with Python's new parser's error handling (and if so, should I report it)?

Remove hard dependencies

Currently we have hard dependencies on psutil, flask and flaks-wtf which seems completely overkill since one can perfectly use pegen without any of those.

Consistent keyword ordering

Every time I run pegen on the same grammar, I (often) get a different order for KEYWORDS and SOFT_KEYWORDS. This is annoying for version control, when parse.py is committed to a repo.

I imagine the cause is that the keywords are stored in a set which has nondeterministic order, and sorted will make this behavior deterministic. Specifically, tuple(...) should be replaced with tuple(sorted(...)) in these lines:

self.print(f"KEYWORDS = {tuple(self.callmakervisitor.keywords)}")
self.print(f"SOFT_KEYWORDS = {tuple(self.callmakervisitor.soft_keywords)}")

Is it appropriate to make PR here, or would it be better against cpython?

data/python_parser.py out of sync with data/python.gram

data/python.gram is updated 5 days ago, but not the data/python_parser.py

Since data/python_parser.py is generated, maybe we need a git hook to auto-gen data/python_parser.py whenever data/python.gram is updated.

BTW, the 1st release (0.1.0 Sep 6, 2021) is half year ago, are we planning to do a new release?

`data/python.gram` produces wrong results when parsing fstring with `=` in `{}`

python -m pegen data/python.gram -o python_parser.py

import ast
import python_parser as pp

print(ast.dump(pp.parse_string('f"{x=}"', 'eval')))
print(ast.dump(ast.parse('f"{x=}"', mode='eval')))

outputs this:

Expression(body=JoinedStr(values=[FormattedValue(value=Name(id='x', ctx=Load()), conversion=114)]))
Expression(body=JoinedStr(values=[Constant(value='x='), FormattedValue(value=Name(id='x', ctx=Load()), conversion=114)]))

We can see that "=" is not considered in the python.gram.

parse error

start[ast.Module]: a=expr_stmt* ENDMARKER { ast.Module(body=a or [] }
expr_stmt: a=expr NEWLINE { ast.Expr(value=a, EXTRA) }

expr:
    | l=expr '+' r=term { ast.BinOp(left=l, op=ast.Add(), right=r, EXTRA) }
    | l=expr '-' r=term { ast.BinOp(left=l, op=ast.Sub(), right=r, EXTRA) }
    | term

term:
    | l=term '*' r=factor { ast.BinOp(left=l, op=ast.Mult(), right=r, EXTRA) }
    | l=term '/' r=factor { ast.BinOp(left=l, op=ast.Div(), right=r, EXTRA) }
    | factor

factor:
    | '(' e=expr ')' { e }
    | atom

atom:
    | NAME
    | NUMBER

These codes are copied from : https://we-like-parsers.github.io/pegen/grammar.html
and I use python3.10 to parse but it failed, obvisouly, "ast.Module(body=a or [] " here lacks a ')' and there other issues such as

  File "<unknown>", line 1
    ast . Expr ( value = a , EXTRA )
                                   ^
SyntaxError: positional argument follows keyword argument     
For full traceback, use -v

Tokenizer returns empty string for the last line

Hi 👋 I'm using pegen for parsing a Python-like language, encountered something unexpected, and wondering if anyone can take a look.

Problem

It seems that pegen.tokenizer.Tokenizer stores the empty string for the last line of the source if the source does not end with a NEWLINE.

import io
import tokenize

from pegen.tokenizer import Tokenizer

source = "line 1\nline 2\nline 3"

print(source)

tok_stream = tokenize.generate_tokens(io.StringIO(source).readline)
tokenizer = Tokenizer(tok_stream)

# load all tokens
while True:
    tok = tokenizer.getnext()
    print(tok)
    if tok.type == tokenize.ENDMARKER:
        break

print(tokenizer._lines)

print("last line: ", tokenizer.get_lines([3]))

I expect the last line to print ["line 3"] but it prints [''].

After some investigation, I found that the empty line is coming from NEWLINE tokens generated by Python's tokenize module.

If the source does not end with a NEWLINE, tokenize injects one. For example, tokenizing the above input would produce

TokenInfo(type=1 (NAME), string='line', start=(1, 0), end=(1, 4), line='line 1\n')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 5), end=(1, 6), line='line 1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 6), end=(1, 7), line='line 1\n')
TokenInfo(type=1 (NAME), string='line', start=(2, 0), end=(2, 4), line='line 2\n')
TokenInfo(type=2 (NUMBER), string='2', start=(2, 5), end=(2, 6), line='line 2\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 6), end=(2, 7), line='line 2\n')
TokenInfo(type=1 (NAME), string='line', start=(3, 0), end=(3, 4), line='line 3')
TokenInfo(type=2 (NUMBER), string='3', start=(3, 5), end=(3, 6), line='line 3')
TokenInfo(type=4 (NEWLINE), string='', start=(3, 6), end=(3, 7), line='') 👈
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')

The second-to-last token does not exist in the original source but was added by the tokenizer. The problem is that this injected token does not have line (which on its own is reasonable).

self._lines[tok.start[0]] = tok.line

Pegen's tokenizer uses this to cache the text on that line. So after execution, tokenizer._lines is set to {1: 'line 1\n', 2: 'line 2\n', 3: '', 4: ''}.

Possible Fixes

If this is an unintentional behavior, a possible fix I can think of is to update _lines only if the same line has not been set.

I've added a test case, implemented this change, and it seems to be working.

https://github.com/we-like-parsers/pegen/compare/main...shumbo:pegen:fix-last-line?expand=1

Is there anything else to consider, especially around the condition of the if statement? I can create a PR if this may benefit someone else.

distribution doesn't seem to include `py.typed`

After a simple pip install pegen, the directory site-packages/pegen under my venv doesn't include py.typed and consequently mypy fails to typecheck my code which imports stuff from pegen. I'd submit a PR but I expect you'll want to include this file via pyproject.toml rather than setup.py, and honestly I'm not not sure how to do that.

For the moment, I just did touch .venv/lib/python3.11/site-packages/pegen/py.typed to nudge things along, but it would be great if this wasn't necessary. Thanks!

Update README.md file

The README.md file still contains the information from the old repo. This needs to be updated

Make a 2023 release

The latest, and only, release of pegen was v0.1.0 in 2021. There has been many commits and bug fixes since then. Are there any plans on creating a new release that will be available on pypi.org soon?

Providing a different tokenizer

I note the docs say:

Tokens are restricted the the ones available in the tokenize module of the Python interpreter that is used to generate the parser. This means that tokenization of any parser generated by pegen must be a subset of the tokenization that Python itself uses.

Is there a reason this is a hard requirement? I'd like to use my own tokenizer because I want to match non-Python syntax (for example, SI-prefixed quantities such as 1.23k, which doesn't get matched by any of the built-in tokenize module tokens as far as I am aware). I notice in pegen.parser.Parser the productions for Python types are hard-coded to check for collisions with keywords etc. but perhaps the parser generator could also allow the developer to handle this themselves if they wish to use their own tokenizer.

Rule inlining

On an example grammar like this,

@subheader """\
import ast
import itertools

def parse_number(token):
    for converter in [int, float, complex]:
        try:
            return converter(token.string)
        except ValueError:
            continue
    raise ValueError(f"Can't convert: {{number!r}}")

def parse_string(tokens):
    return ''.join(token.string for token in itertools.from_iterable(tokens))
"""

start[ast.mod]: a=expression NEWLINE* ENDMARKER { ast.dump(ast.Expression(a)) }

expression[ast.expr] (memo):
    | primary

primary[ast.expr]:
    | atom

atom[ast.expr]:
    | 'True' { ast.Constant(True) }
    | 'False' { ast.Constant(False) }
    | 'None' { ast.Constant(None) }
    | '...' { ast.Constant(Ellipsis) }
    | t_name
    | t_number
    | &STRING strings

t_name[ast.Name]: NAME { ast.Name(name.string, ast.Load()) }
t_number[ast.Constant]: NUMBER { ast.Constant(parse_number(number)) }

strings[ast.Constant] (memo): strings=STRING+ { ast.Constant(parse_string(strings)) }

When we have different rule groups (like primary, expression etc) this stuff gets piled up. So instead of seeing Expression(body=Constant(value=1.0)) as the output, we see Expression(body=[[[Constant(value=1.0)]]]). This can be solved by adding actions that returns itself, like;

expression[ast.expr] (memo):
    | primary { primary }

primary[ast.expr]:
    | atom { atom }

atom[ast.expr]:
    | 'True' { ast.Constant(True) }
    | 'False' { ast.Constant(False) }
    | 'None' { ast.Constant(None) }
    | '...' { ast.Constant(Ellipsis) }
    | t_name { t_name }
    | t_number { t_number }
    | &STRING strings { strings }

or having an operator which actually does this under the hood. An example is the inline operator in the lark-parser. This is how it might look

start[ast.mod]: a=expression NEWLINE* ENDMARKER { ast.dump(ast.Expression(a)) }

?expression[ast.expr] (memo):
    | primary

?primary[ast.expr]:
    | atom

?atom[ast.expr]:
    | 'True' { ast.Constant(True) }
    | 'False' { ast.Constant(False) }
    | 'None' { ast.Constant(None) }
    | '...' { ast.Constant(Ellipsis) }
    | t_name
    | t_number
    | &STRING strings

?t_name[ast.Name]: NAME { ast.Name(name.string, ast.Load()) }
?t_number[ast.Constant]: NUMBER { ast.Constant(parse_number(number)) }

?strings[ast.Constant] (memo): strings=STRING+ { ast.Constant(parse_string(strings)) }

Making extensible parsers

A bit of context. While working on enaml (https://github.com/nucleic/enaml) (which through the time I have maintained it has supported Python 2, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, and strives to be able to parse any valid Python), I often add to marginally modify the parser to either support new syntax or changes to the ast nodes creation. Since I was using ply I used to subclass the parser to add/modify rules or overwrite methods to handle ast node changes.

Using pegen mean a new parser would need to be generated for each supported python version. The grammar files would obviously share a lot of code and I am wondering if there are ways to limit duplication. If only the ast node changes, one can probably alter the subheader to use a different base class for the parser and use a method in the affected rules, but this does not scale to changes in the grammar proper.

Bug with zero or more rule

Gram file

start: stmts

stmts: stmt*

stmt: block

block: "{" stmts "}"

Input

{ {} }

Result

  File "<unknown>", line 1
    { {} }
       ^
SyntaxError: input.txt

The problem in the generated parse rule for the block stmt
Code

    @memoize
    def block(self) -> Optional[Any]:
        # block: "{" stmts "}"
        mark = self._mark()
        if (
            (literal := self.expect("{"))
            and
            (stmts := self.stmts())
            and
            (literal_1 := self.expect("}"))
        ):
            return [literal, stmts, literal_1];
        self._reset(mark)
        return None;

(stmts := self.stmts()) shoud be (stmts := self.stmts()) is not None because an empty array is interpreted as false.

Strange syntax error only with zero '0' converted to int

With this extremely simple grammar "1" succeeds but "0" fails. Can anyone shed light on this issue? Thanks

zero-fails.py

from prettyprinter import pprint as pp

from pegen.grammar_parser import GeneratedParser as GrammarParser
from pegen.utils import generate_parser, parse_string

grammar_spec = """
    start  : a=num

    num    : a=NUMBER { float(a.string) if '.' in a.string else int(a.string) }
"""

numbers = [ 1, 0 ]

for n in numbers :
    grammar = parse_string(grammar_spec, GrammarParser)
    # print() ; pp(grammar)
    grammar_parser_class = generate_parser(grammar)
    tree = parse_string(str(n), grammar_parser_class, verbose=True)
    verdict = 'PASSED' if tree is not None else 'FAILED'
    pp([ verdict, tree ]) ; print()

Output

start() ... (looking at 1.0: NUMBER:'1')
  num() ... (looking at 1.0: NUMBER:'1')
    number() ... (looking at 1.0: NUMBER:'1')
    ... number() -> TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1')
  ... num() -> 1
... start() -> 1
['PASSED', 1]

start() ... (looking at 1.0: NUMBER:'0')
  num() ... (looking at 1.0: NUMBER:'0')
    number() ... (looking at 1.0: NUMBER:'0')
    ... number() -> TokenInfo(type=2 (NUMBER), string='0', start=(1, 0), end=(1, 1), line='0')
  ... num() -> 0
... start() -> None
Traceback (most recent call last):
  File "/home/phdyex/src/python/grammar-tool/test-files/pegen/gunit/simple/issue/zero-syntax-error/./zero-fails.py", line 20, in <module>
    tree = parse_string(str(n), grammar_parser_class, verbose=True)
  File "/home/phdyex/.cache/pypoetry/virtualenvs/grammar-tool-srSGyud3-py3.10/lib/python3.10/site-packages/pegen/utils.py", line 66, in parse_string
    return run_parser(file, parser_class, verbose=verbose)  # type: ignore # typeshed issue #3515
  File "/home/phdyex/.cache/pypoetry/virtualenvs/grammar-tool-srSGyud3-py3.10/lib/python3.10/site-packages/pegen/utils.py", line 55, in run_parser
    raise parser.make_syntax_error("invalid syntax")
  File "<unknown>", line 1
    0
    ^
SyntaxError: invalid syntax

Defining things like what an identifier is

I can't seem to find any documentation on this, so I thought I'd try here.

In many grammar specifications, rules like NAME or NUMBER are used. I can see these defined in the file Tokens, but how do I define these? Is it safe to do:

identifier: characters_for_an_identifier

Or are there better ways of doing this? I'm curious because different languages define what an "identifier" is, so I was curious how this is handled, and where these rules/tokens are (actually) defined.

Any plans to make a more generic version?

I'm trying to generate a parser for a language that includes attribute names that contain -s and hex colors that begin with #.

The first one could be solved by reconstructing the name from NAME ('_' NAME)*, but that would also accept a - b as a name, which is not.

The second one forces us to use a second parser to separate the hex portion from f.i. any trailing comments (#abcdef // foo).

Document repo contents

Thanks to @pablogsal, I just learned about the data directory which has some cool example grammars, including a Python port of python.gram — impressive! It would have saved me a bunch of time for the README to link to the data directory, and for the data directory to have a README explaining what the various files are.

I could write a minimal PR but I don't know what most of the directories in this repo are (beyond the obvious src/pegen).

On the point of repo contents, I find the relation between this repo, the cpython edition, and the PyPI release confusing. Perhaps it would help to clarify that the intent is to keep this repo synchronized with the cpython edition (at least I get that impression from history), but there may be lag? Also, #1 suggests the latter doesn't exist, and in my limited testing, the PyPI edition seemed out-of-date; I could be wrong on that.

Source for the Python parser

pegen/data contains a Python parser supposedly generated from data/python.gram but this file only contains C like rules ? Would it be possible to get the file containing the python rules ?

Test failure with Python 3.11

The tests are passing with Python 3.10 but not with Python 3.11 for me.

============================= test session starts ==============================
platform linux -- Python 3.11.3, pytest-7.2.1, pluggy-1.0.0
rootdir: /build/source, configfile: pyproject.toml
collected 371 items                                                            

stories/story1/test_parser.py ..                                         [  0%]
[...]
........................................................................ [ 86%]
.....................FF..........                                        [ 95%]
tests/python_parser/test_unsupported_syntax.py .................         [100%]

=================================== FAILURES ===================================
______ test_invalid_def_stmt[def f:-SyntaxError-expected '('-start4-end4] ______

python_parse_file = <function parse_file at 0x7ffff62b7380>
python_parse_str = <function parse_string at 0x7ffff5078e00>
tmp_path = PosixPath('/build/pytest-of-nixbld/pytest-0/test_invalid_def_stmt_def_f__S0')
source = 'def f:', exception = <class 'SyntaxError'>, message = "expected '('"
start = (1, 6), end = (1, 6)

    @pytest.mark.parametrize(
        "source, exception, message, start, end",
        [
            (
                "def f():\npass",
                IndentationError,
                "expected an indented block after function definition on line 1",
                (2, 1),
                (2, 5),
            ),
            (
                "async def f():\npass",
                IndentationError,
                "expected an indented block after function definition on line 1",
                (2, 1),
                (2, 5),
            ),
            (
                "def f(a,):\npass",
                IndentationError,
                "expected an indented block after function definition on line 1",
                (2, 1),
                (2, 5),
            ),
            (
                "def f() -> None:\npass",
                IndentationError,
                "expected an indented block after function definition on line 1",
                (2, 1),
                (2, 5),
            ),
            ("def f:", SyntaxError, "expected '('", (1, 6), (1, 6)),
            ("async def f:", SyntaxError, "expected '('", (1, 12), (1, 12)),
            # (
            #     "def f():\n# type: () -> int\n# type: () -> str\n\tpass",
            #     SyntaxError,
            #     "expected an indented block after function definition on line 1",
            # ),
        ],
    )
    def test_invalid_def_stmt(
        python_parse_file, python_parse_str, tmp_path, source, exception, message, start, end
    ):
>       parse_invalid_syntax(
            python_parse_file,
            python_parse_str,
            tmp_path,
            source,
            exception,
            message,
            start,
            end,
            (3, 11) if exception is SyntaxError else (3, 10),
        )

tests/python_parser/test_syntax_error_handling.py:1228: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

python_parse_file = <function parse_file at 0x7ffff62b7380>
python_parse_str = <function parse_string at 0x7ffff5078e00>
tmp_path = PosixPath('/build/pytest-of-nixbld/pytest-0/test_invalid_def_stmt_def_f__S0')
source = 'def f:', exc_cls = <class 'SyntaxError'>, message = "expected '('"
start = (1, 6), end = (1, 6), min_python_version = (3, 11)

    def parse_invalid_syntax(
        python_parse_file,
        python_parse_str,
        tmp_path,
        source,
        exc_cls,
        message,
        start,
        end,
        min_python_version=(3, 10),
    ):
    
        # Check we obtain the expected error from Python
        try:
            exec(source, {}, {})
        except exc_cls as py_e:
            py_exc = py_e
        except Exception as py_e:
            assert (
                False
            ), f"Python produced {py_e.__class__.__name__} instead of {exc_cls.__name__}: {py_e}"
        else:
            assert False, f"Python did not throw any exception, expected {exc_cls}"
    
        # Check our parser raises both from str and file mode.
        with pytest.raises(exc_cls) as e:
            python_parse_str(source, "exec")
    
        print(str(e.exconly()))
        assert message in str(e.exconly())
    
        test_file = tmp_path / "test.py"
        with open(test_file, "w") as f:
            f.write(source)
    
        with pytest.raises(exc_cls) as e:
            python_parse_file(str(test_file))
    
        # Check Python message but do not expect message to match for earlier Python versions
        if sys.version_info >= min_python_version:
            # This fails for Python < 3.10.5 but keeping the fix for a patch version is not
            # worth it
            assert message in py_exc.args[0]
    
        print(str(e.exconly()))
        assert message in str(e.exconly())
    
        # Check start/end line/column on Python 3.10
        for parser, exc in ([("Python", py_exc)] if sys.version_info >= min_python_version else []) + [
            ("pegen", e.value)
        ]:
            if (
                exc.lineno != start[0]
                or exc.offset != start[1]
                # Do not check end for indentation errors
                or (
                    sys.version_info >= (3, 10)
                    and not isinstance(e, IndentationError)
                    and exc.end_lineno != end[0]
                )
                or (
                    sys.version_info >= (3, 10)
                    and not isinstance(e, IndentationError)
                    and (end[1] is not None and exc.end_offset != end[1])
                )
            ):
                if sys.version_info >= (3, 10):
>                   raise ValueError(
                        f"Expected locations of {start} and {end}, but got "
                        f"{(exc.lineno, exc.offset)} and {(exc.end_lineno, exc.end_offset)} "
                        f"from {parser}"
                    )
E                   ValueError: Expected locations of (1, 6) and (1, 6), but got (1, 6) and (1, 7) from Python

tests/python_parser/test_syntax_error_handling.py:74: ValueError
----------------------------- Captured stdout call -----------------------------
  File "<unknown>", line 1
    def f:
         ^
SyntaxError: expected '('
  File "test.py", line 1
    def f:
         ^
SyntaxError: expected '('
___ test_invalid_def_stmt[async def f:-SyntaxError-expected '('-start5-end5] ___

python_parse_file = <function parse_file at 0x7ffff62b7380>
python_parse_str = <function parse_string at 0x7ffff5078e00>
tmp_path = PosixPath('/build/pytest-of-nixbld/pytest-0/test_invalid_def_stmt_async_de1')
source = 'async def f:', exception = <class 'SyntaxError'>
message = "expected '('", start = (1, 12), end = (1, 12)

    @pytest.mark.parametrize(
        "source, exception, message, start, end",
        [
            (
                "def f():\npass",
                IndentationError,
                "expected an indented block after function definition on line 1",
                (2, 1),
                (2, 5),
            ),
            (
                "async def f():\npass",
                IndentationError,
                "expected an indented block after function definition on line 1",
                (2, 1),
                (2, 5),
            ),
            (
                "def f(a,):\npass",
                IndentationError,
                "expected an indented block after function definition on line 1",
                (2, 1),
                (2, 5),
            ),
            (
                "def f() -> None:\npass",
                IndentationError,
                "expected an indented block after function definition on line 1",
                (2, 1),
                (2, 5),
            ),
            ("def f:", SyntaxError, "expected '('", (1, 6), (1, 6)),
            ("async def f:", SyntaxError, "expected '('", (1, 12), (1, 12)),
            # (
            #     "def f():\n# type: () -> int\n# type: () -> str\n\tpass",
            #     SyntaxError,
            #     "expected an indented block after function definition on line 1",
            # ),
        ],
    )
    def test_invalid_def_stmt(
        python_parse_file, python_parse_str, tmp_path, source, exception, message, start, end
    ):
>       parse_invalid_syntax(
            python_parse_file,
            python_parse_str,
            tmp_path,
            source,
            exception,
            message,
            start,
            end,
            (3, 11) if exception is SyntaxError else (3, 10),
        )

tests/python_parser/test_syntax_error_handling.py:1228: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

python_parse_file = <function parse_file at 0x7ffff62b7380>
python_parse_str = <function parse_string at 0x7ffff5078e00>
tmp_path = PosixPath('/build/pytest-of-nixbld/pytest-0/test_invalid_def_stmt_async_de1')
source = 'async def f:', exc_cls = <class 'SyntaxError'>
message = "expected '('", start = (1, 12), end = (1, 12)
min_python_version = (3, 11)

    def parse_invalid_syntax(
        python_parse_file,
        python_parse_str,
        tmp_path,
        source,
        exc_cls,
        message,
        start,
        end,
        min_python_version=(3, 10),
    ):
    
        # Check we obtain the expected error from Python
        try:
            exec(source, {}, {})
        except exc_cls as py_e:
            py_exc = py_e
        except Exception as py_e:
            assert (
                False
            ), f"Python produced {py_e.__class__.__name__} instead of {exc_cls.__name__}: {py_e}"
        else:
            assert False, f"Python did not throw any exception, expected {exc_cls}"
    
        # Check our parser raises both from str and file mode.
        with pytest.raises(exc_cls) as e:
            python_parse_str(source, "exec")
    
        print(str(e.exconly()))
        assert message in str(e.exconly())
    
        test_file = tmp_path / "test.py"
        with open(test_file, "w") as f:
            f.write(source)
    
        with pytest.raises(exc_cls) as e:
            python_parse_file(str(test_file))
    
        # Check Python message but do not expect message to match for earlier Python versions
        if sys.version_info >= min_python_version:
            # This fails for Python < 3.10.5 but keeping the fix for a patch version is not
            # worth it
            assert message in py_exc.args[0]
    
        print(str(e.exconly()))
        assert message in str(e.exconly())
    
        # Check start/end line/column on Python 3.10
        for parser, exc in ([("Python", py_exc)] if sys.version_info >= min_python_version else []) + [
            ("pegen", e.value)
        ]:
            if (
                exc.lineno != start[0]
                or exc.offset != start[1]
                # Do not check end for indentation errors
                or (
                    sys.version_info >= (3, 10)
                    and not isinstance(e, IndentationError)
                    and exc.end_lineno != end[0]
                )
                or (
                    sys.version_info >= (3, 10)
                    and not isinstance(e, IndentationError)
                    and (end[1] is not None and exc.end_offset != end[1])
                )
            ):
                if sys.version_info >= (3, 10):
>                   raise ValueError(
                        f"Expected locations of {start} and {end}, but got "
                        f"{(exc.lineno, exc.offset)} and {(exc.end_lineno, exc.end_offset)} "
                        f"from {parser}"
                    )
E                   ValueError: Expected locations of (1, 12) and (1, 12), but got (1, 12) and (1, 13) from Python

tests/python_parser/test_syntax_error_handling.py:74: ValueError
----------------------------- Captured stdout call -----------------------------
  File "<unknown>", line 1
    async def f:
               ^
SyntaxError: expected '('
  File "test.py", line 1
    async def f:
               ^
SyntaxError: expected '('
=========================== short test summary info ============================
FAILED tests/python_parser/test_syntax_error_handling.py::test_invalid_def_stmt[def f:-SyntaxError-expected '('-start4-end4] - ValueError: Expected locations of (1, 6) and (1, 6), but got (1, 6) and (1,...
FAILED tests/python_parser/test_syntax_error_handling.py::test_invalid_def_stmt[async def f:-SyntaxError-expected '('-start5-end5] - ValueError: Expected locations of (1, 12) and (1, 12), but got (1, 12) and ...
======================== 2 failed, 369 passed in 5.74s =========================
/nix/store/37p8gq9zijbw6pj3lpi1ckqiv18j2g62-stdenv-linux/setup: line 1594: pop_var_context: head of shell_variables not a function context
error: builder for '/nix/store/px2vanhxlchdwlan0n6xjn0xvh0d715n-python3.11-pegen-0.2.0.drv' failed with exit code 1;
       last 10 log lines:
       > SyntaxError: expected '('
       >   File "test.py", line 1
       >     async def f:
       >                ^
       > SyntaxError: expected '('
       > =========================== short test summary info ============================
       > FAILED tests/python_parser/test_syntax_error_handling.py::test_invalid_def_stmt[def f:-SyntaxError-expected '('-start4-end4] - ValueError: Expected locations of (1, 6) and (1, 6), but got (1, 6) and (1,...
       > FAILED tests/python_parser/test_syntax_error_handling.py::test_invalid_def_stmt[async def f:-SyntaxError-expected '('-start5-end5] - ValueError: Expected locations of (1, 12) and (1, 12), but got (1, 12) and ...
       > ======================== 2 failed, 369 passed in 5.74s =========================

Expression is not evaluating properly

I am following the code of story1.

For expression like 1 + 2 * 3 is evaluating correctly but expression like 1 - 2 * 3 - 8 - 5 is not evaluating properly

The evaluation of 1 - 2 * 3 - 8 - 5 should be -18 not -2

Here is my code to evaluate expression

from io import StringIO
from token import NAME, NUMBER, NEWLINE, ENDMARKER
from tokenize import generate_tokens

from tokenizer import Tokenizer
from parser import Parser
from toy import ToyParser

def test_expr():     
    def _eval(node):
        if node.type == 'add':
            return _eval(node.children[0]) + _eval(node.children[1])
        elif node.type == 'sub':
            return _eval(node.children[0]) - _eval(node.children[1])
        elif node.type == 'mul':
            return _eval(node.children[0]) * _eval(node.children[1])
        elif node.type == 'div':
            return _eval(node.children[0]) / _eval(node.children[1])
        elif node.type == NUMBER:
            return int(node.string)
    def _print(node):
        if node.type == 'add':
            return ("{} + {}".format(
                    _print(node.children[0]), _print(node.children[1])))
        elif node.type == 'sub':
            return ("{} - {}".format(
                    _print(node.children[0]), _print(node.children[1])))
        elif node.type == 'mul':
            return ("{} * {}".format(
                    _print(node.children[0]), _print(node.children[1])))
        elif node.type == 'div':
            return ("{} / {}".format(
                    _print(node.children[0]), _print(node.children[1])))
        elif node.type == NUMBER:
            return str(int(node.string))
    program = "(1 + 2) * 3"
    program = "1 - 2 * 3 - 8 - 5"
    file = StringIO(program)
    tokengen = generate_tokens(file.readline)
    tok = Tokenizer(tokengen)
    p = ToyParser(tok)
    tree = p.statement()
    print(program)
    print(tree)
    print(_eval(tree))
    print(_print(tree))
    print(eval(program))
    
test_expr()

Here is the output

1 - 2 * 3 - 8 - 5
Node(sub, [TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1 - 2 * 3 - 8 - 5'), Node(sub, [Node(mul, [TokenInfo(type=2 (NUMBER), string='2', start=(1, 4), end=(1, 5), line='1 - 2 * 3 - 8 - 5'), TokenInfo(type=2 (NUMBER), string='3', start=(1, 8), end=(1, 9), line='1 - 2 * 3 - 8 - 5')]), Node(sub, [TokenInfo(type=2 (NUMBER), string='8', start=(1, 12), end=(1, 13), line='1 - 2 * 3 - 8 - 5'), TokenInfo(type=2 (NUMBER), string='5', start=(1, 16), end=(1, 17), line='1 - 2 * 3 - 8 - 5')])])])
-2
1 - 2 * 3 - 8 - 5
-18

I could figure it out whether my _eval() is correct or not Or There is a bug in ToyParser class.

Please help it out.Thank you.

Attachment: story1.zip

Documentation: should C be mentioned

Since one of the goal of the repo is to publish pegen on PyPI while including only the Python generator should the documentation mention C at all ?

I am going through the docs and could make a PR to make the docs refer only to Python if that makes sense.

Support for python 3.7

I have a patch to make pegen generate parsers that work in python 3.7, primarily by not using the walrus operator: https://github.com/ahupp/pegen/tree/python37

The generated code is pretty ugly and 3.7 is EOL in a year so I would not suggest merging it, but if anyone else is unfortunate enough to need python 3.7 support they might find this handy.

Pegen Crashes on f-strings in 3.12

Pegen seems to crash when trying to parse f-strings in Python 3.12. I'm guessing this is because Pegen is invoking the 3.12 tokenizer that now uses several new tokens for f-strings, not just STRING.

To reproduce I installed Pegen in 3.12 and ran python -m pegen data/python.gram, which resulted in the following error:

  File "<unknown>", line 1690
    self.raise_syntax_error_known_range(f"cannot assign to {a.string}", a, b)
                                        ^
SyntaxError: data/python.gram

typo in README.md

In the very last codeblock, at the first line, there is a missing parenthesis:

ast.Module(body=a or []
                       ^

Allow parsing deeply nested structures

In enaml (https://github.com/nucleic/enaml), one user reported an error stemming from a recursion error when trying to parse a deeply nested function (see below). The only solution I could think of would for pegen to make use of the trampoline pattern to avoid recursion through the use of generators. However, such a change would be highly invasive and would make the code harder to reason about.

Does anybody has an alternative option to suggest ?

def fn_opcua_get_tree(i_obj_tree):
	"""	Get OPCUA nodes
	"""
	global g_dict_cnx	# Allow global access

	if i_obj_tree is None or i_obj_tree.str_type != G_OPCUA_TYPE:
		l_dict_valid = {}

		if g_dict_cnx["client"] is not None:
			# List OPCUA root nodes (namespaces)
			l_list_root = [fn_opcua_get_node(
					g_dict_cnx["client"].nodes.root
				)]

			# Debug
			#pprint.pprint(l_list_root)
			#pprint.pprint(l_list_root.keys())
			#pprint.pprint(l_list_root.items())
			#pprint.pprint(l_list_root['children'].items())
			#print(type(l_list_root))
			#print(type(l_list_root.keys()))
			#print(type(l_list_root.items()))

			#for loop_dict_1 in l_list_root:
				#print(type(loop_dict_1))
				#print(loop_dict_1)

			if True:
				# Create the tree from 'valid' nodes (up to 9 level deep)
				i_obj_tree = \
					EnamlxTree(
						str_type = G_OPCUA_TYPE,
						list_root = [
							EnamlxNode(
								str_text = f"{loop_dict_1['name']}",
								#			 enum_COLUMN.Id,	enum_COLUMN.Cls,	enum_COLUMN.Type
								list_data = [loop_dict_1['id'], loop_dict_1['cls'], loop_dict_1['type']],
								list_node = [
									EnamlxNode(
										str_text = f"{loop_dict_2['name']}",
										list_data = [loop_dict_2['id'], loop_dict_2['cls'], loop_dict_2['type']],
										list_node = [
											EnamlxNode(
												str_text = f"{loop_dict_3['name']}",
												list_data = [loop_dict_3['id'], loop_dict_3['cls'], loop_dict_3['type']],
												list_node = [
													EnamlxNode(
														str_text = f"{loop_dict_4['name']}",
														list_data = [loop_dict_4['id'], loop_dict_4['cls'], loop_dict_4['type']],
														list_node = [
															EnamlxNode(
																str_text = f"{loop_dict_5['name']}",
																list_data = [loop_dict_5['id'], loop_dict_5['cls'], loop_dict_5['type']],
																list_node = [
																	EnamlxNode(
																		str_text = f"{loop_dict_6['name']}",
																		list_data = [loop_dict_6['id'], loop_dict_6['cls'], loop_dict_6['type']],
																		list_node = [
																			EnamlxNode(
																				str_text = f"{loop_dict_7['name']}",
																				list_data = [loop_dict_7['id'], loop_dict_7['cls'], loop_dict_7['type']],
																				list_node = [
																					EnamlxNode(
																						str_text = f"{loop_dict_8['name']}",
																						list_data = [loop_dict_8['id'], loop_dict_8['cls'], loop_dict_8['type']],
																						list_node = [
																							EnamlxNode(
																								str_text = f"{loop_dict_9['name']}",
																								list_data = [loop_dict_9['id'], loop_dict_9['cls'], loop_dict_9['type']],
																								list_node = [
																								]
																							) for loop_dict_9 in loop_dict_8['children']
																						]
																					) for loop_dict_8 in loop_dict_7['children']
																				]
																			) for loop_dict_7 in loop_dict_6['children']
																		]
																	) for loop_dict_6 in loop_dict_5['children']
																]
															) for loop_dict_5 in loop_dict_4['children']
														]
													) for loop_dict_4 in loop_dict_3['children']
												]
											) for loop_dict_3 in loop_dict_2['children']
										]
									) for loop_dict_2 in loop_dict_1['children']
								]
							) for loop_dict_1 in l_list_root
						]
					)
			else:
				# Test tree
				i_obj_tree = \
					EnamlxTree(
						str_type = G_OPCUA_TYPE,
						list_root = [
							EnamlxNode(
								str_text = "TEST",
								list_data = ["", 0, None],
								list_node = []
								)
						])
		else:
			# Empty tree
			i_obj_tree = \
				EnamlxTree(
					str_type = G_OPCUA_TYPE,
					list_root = [
						EnamlxNode(
							str_text = "",
							list_data = ["", "", ""],
							list_node = []
							)
					])

	return i_obj_tree

Publishing on the PyPI

Pegen's Python parser generator is really cool, and also very useful though the version on the PyPI seems not sync with the upstream version under Tools/peg_generator. It is possible now to just clone CPython and install it though not very feasible. Or the other way would be, after generating the parser just hosting pegen/parser.py / pegen/tokenizer.py under the project (which is what I do now) though it is not a very good solution either. I think it would be really amazing if we host this package on the PyPI (maybe with automating some stuff on the GH actions to clone CPython, check if anything changed under Tools/peg_generator, and sync it with here and open an issue to notify the maintainers that they should make a release). If any help needed, I can try to assist.

Is formatting subheader required ?

Currently both the header and the subheader are formatted using .format(filename=filename).

I agree it is useful to include the name of the grammar in the header, however doing the same for the subheader does not seem necessary and make the use of any kind of formatting operation (for error messages for example) painful due to the need to use double {} to escape the first formatting.

Would the authors be favorable to removing the formatting of the subheader.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.