we-like-parsers / pegen_experiments Goto Github PK

View Code? Open in Web Editor NEW

271.0 24.0 31.0 18.92 MB

Experiments for the official PEG parser generator for Python

Home Page: https://github.com/python/cpython/tree/master/Tools/peg_generator

License: Other

Python 98.91% Makefile 0.56% C 0.53%

pegen_experiments's People

Contributors

Stargazers

Watchers

pegen_experiments's Issues

pgen segfaults when generating the extension for test_same_name_different_types

When executing:

python -m pytest -v -k test_same_name_different_types

I get a segmentation fault when loading the generated extension:

============================================================= test session starts =============================================================
platform linux -- Python 3.8.0+, pytest-5.2.2, py-1.8.0, pluggy-0.13.1 -- /home/pablogsal/github/pegen/../python/3.8/python
cachedir: .pytest_cache
rootdir: /home/pablogsal/github/pegen, inifile: pytest.ini
plugins: cov-2.8.1
collected 174 items / 173 deselected / 1 selected                                                                                             

test/test_c_parser.py::test_same_name_different_types Fatal Python error: Segmentation fault

Current thread 0x00007f437be02740 (most recent call first):
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 1101 in create_module
  File "<frozen importlib._bootstrap>", line 556 in module_from_spec
  File "./pegen/testutil.py", line 62 in import_file
  File "./pegen/testutil.py", line 84 in generate_parser_c_extension
  File "/home/pablogsal/github/pegen/test/test_c_parser.py", line 32 in verify_ast_generation
  File "/home/pablogsal/github/pegen/test/test_c_parser.py", line 229 in test_same_name_different_types
  File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/python.py", line 170 in pytest_pyfunc_call
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/python.py", line 1423 in runtest
  File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/runner.py", line 125 in pytest_runtest_call
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/runner.py", line 201 in <lambda>
  File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/runner.py", line 229 in from_call
  File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/runner.py", line 200 in call_runtest_hook
  File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/runner.py", line 176 in call_and_report
  File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/runner.py", line 95 in runtestprotocol
  File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/runner.py", line 80 in pytest_runtest_protocol
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/main.py", line 258 in pytest_runtestloop
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/main.py", line 237 in _main
  File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/main.py", line 193 in wrap_session
  File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/main.py", line 230 in pytest_cmdline_main
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/config/__init__.py", line 90 in main
  File "/home/pablogsal/.local/lib/python3.8/site-packages/pytest.py", line 101 in <module>
  File "/home/pablogsal/github/python/3.8/Lib/runpy.py", line 86 in _run_code
  File "/home/pablogsal/github/python/3.8/Lib/runpy.py", line 193 in _run_module_as_main
[1]    19965 segmentation fault (core dumped)  ../python/3.8/python -m pytest -v -k test_same_name_different_types

Pegen doesn't work with Python 3.9 due to visibility change

When you try pegen with Python 3.9, you may get an err like this (in Mac; other platforms may vary):

$ make test
python3 -m pegen -q -c data/cprog.gram -o pegen/parse.c --compile-extension
python3 -c "from pegen import parse; t = parse.parse_file('data/cprog.txt'); exec(compile(t, '', 'exec'))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: dlopen(/Users/guido/pegen/pegen/parse.cpython-39-darwin.so, 2): Symbol not found: _PyAST_mod2obj
  Referenced from: /Users/guido/pegen/pegen/parse.cpython-39-darwin.so
  Expected in: flat namespace
 in /Users/guido/pegen/pegen/parse.cpython-39-darwin.so
make: *** [test] Error 1

With the help of Victor Stinner I've found that this is due to the addition of -fvisibility=hidden to CONFIGURE_CFLAGS_NODIST in CPython's Makefile. This flag[1] hides many of the functions we're using, from PyAST_mod2obj via PyTokenizer_Free to _Py_BinOp. All those functions are not part of the public C API, and the new flag hides them from extensions modules like pegen.

The best thing I've come up with is to just remove that flag from CPython's Makefile, make clean, and make altinstall.

A serious consequence of this is that the C code generator will really only be useful once it's been integrated into CPython. I don't want to spend effort on convincing the CPython core devs that we should make all those APIs public.

Here's a patch to CPython's configure file that will preserve the change when the Makefile is rebuilt (but not when configure.in is rebuilt):

diff --git a/configure b/configure
index 44f14c3c2c..1901b6edc7 100755
--- a/configure
+++ b/configure
@@ -7377,10 +7377,10 @@ fi
     { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_enable_visibility" >&5
 $as_echo "$ac_cv_enable_visibility" >&6; }
 
-    if test $ac_cv_enable_visibility = yes
-    then
-      CFLAGS_NODIST="$CFLAGS_NODIST -fvisibility=hidden"
-    fi
+    ## if test $ac_cv_enable_visibility = yes
+    ## then
+    ##   CFLAGS_NODIST="$CFLAGS_NODIST -fvisibility=hidden"
+    ## fi
 
     # if using gcc on alpha, use -mieee to get (near) full IEEE 754
     # support.  Without this, treatment of subnormals doesn't follow

[1] See http://gcc.gnu.org/wiki/Visibility for more on visibility. It's a GCC 4.0+ feature.

What should be included in the docs?

I'm in the process of writing some really simple docs. What I've done so far is write some basic docs for all the external helper functions and all the helper structs, "stealing" some of the things written in notes.md.

I've also created a file called grammar.py, but I've only described the styling I'm using when refactoring the grammar rules.

Do you maybe have any other ideas on what should be included in these early-stage docs?

Unknown boolop found error, Ideas on how to ignore it for now

I'm currently working on getting BinOps, BoolOps and comparisons to work, but I don't really want to refactor the primary rule just yet, which means that if there is something more than just an atom in the expression, CONSTRUCTOR gets called, whose result gets propagated up to the BoolOps. I'm assuming that's why I'm getting a SystemError: unknown boolop found when running simpy_cpython. Any ideas on how we could circumvent this for now?

Simpy refactorings: class_def and arguments

Design a general way to set the context in some nodes

For some nodes, the context (ctx) needs to be assigned correctly (and in some cases, this may involve a recursive traversal). Currently, we only have a helper expr_ty store_name(Parser *p, expr_ty load_name) to assign Store context to names, but we need a more general solution.

The following cases for an expr_ty need to be handled:

Attribute_kind
Subscript_kind
Starred_kind
Name_kind
List_kind
Tuple_kind
handle forbidden names (default branch).

The relevant code in CPython lives in:

https://github.com/python/cpython/blob/master/Python/ast.c#L1124

Simpy refactorings: slices

Populate text attribute of SyntaxError

In order for SyntaxError to be formatted with a nice display of the line, like this,

  File "<unknown>", line 1
    1/
     ^
SyntaxError: invalid syntax

we must set the text attribute to the line containing the error.

What should the expressions rule return?

Up until now I was under the impression that the expressions rule should return either a tuple, if there are multiple expressions or the expression itself, if there only is one.

But trying to refactor the various atom rules, I saw this:

set: '{' expressions '}'

which means that expressions should return an asdl_seq * of all the parsed expressions?

What is the right approach here?

Refactor grammar model

The model and the code generators make heavy use of Rhs, but RHS in general means just " right hand side", with no hint about the type. In the model Rhs is intended to be the right hand side of rules, but the node type of that, after simple optimizations, could be any of the node types [Cut, Named, StringLeaf, Lookahead, PositiveLookahead, NegativeLookahead, Choice, Alt, Seq, Opt, Repeat0, Repeat1, Group].

It's best to drop Rhs and stick to the node types that describe types of expressions. That way the visitors, specially the code generators, can have specific actions for each type.

This goes hand in hand with removing synthetic identifiers for expressions in rules, because they complicate the code generators to benefit only toy parsers. Most subexpressions on the right hand side of a rule are discardable, and name=exp is enough to explicitly recover the elements relevant to the translation.

What's next?

Now that #47 is closed and actions for simpy.gram are ready, I think it'd be helpful to gather thoughts on what's coming next and maybe open issues etc. Here are some things I think are essential:

Add support for f-strings (probably still not a high priority).
Make simpy_cpython pass with TESTFLAGS=-stt.
Write scripts, in order to measure time and space consumption and compare it with pgen.
Begin doing optimisation refactorings: One thought that has been brought up is the automatic insertion of cut operations in the grammar, but I guess we will have to keep iterating more on this.
Possibly publish the parser on PyPI. (?)
Pitch the project to the core-dev team, gather feedback and if all goes well, start integrating the parser into CPython as an alternate parser which can be enabled through a command-line argument.

Have I missed something important?

Strings in AST should not include quotes

I just discovered that make test prints the output as

"Hello ""world"

Using bisection I found that it (correctly) printed

Hello world

before 3be531c (#133). That PR refactored string parsing thoroughly. I presume something went wrong. @pablogsal Can you look into this?

Simpy refactorings: targets

Add actions to data/simpy.gram

There's a complete Python grammar in data/simpy.gram, but it is lacking actions.

OTOH there are actions (written in C) for a very small subset in data/cprog.gram -- these are tested by the Makefile targets test, compile, dump and time. The actions construct accurate AST nodes.

We should add actions to simpy.gram modeled after those in cprog.gram.

This will likely require refactoring the grammar in simpy.gram.

Add custom rule for repetitions separated by another token

As noted in #67 the construct X (sep X)* gets heavily used by the current grammar, at least 16 times according to a quick search on my editor. It would be a good idea to add a custom rule to handle this construct, so that we don't have to use seq_insert_in_front every time, as it creates a brand new asdl_seq*. What would the best design be for such a rule?

How to differentiate between Load and Store for NAMEs?

I got to implementing the actions for while and for and found out that, upon parsing a NAME, name_token gets called, which always uses Load. Except for the quick workaround

target: a=NAME { _Py_Name(a->v.Name.id, Store, EXTRA_EXPR(a, a)) }

there is unfortunately no other way to use Store for a NAME.

What do you think is the best way to implement this?

Should rule actions be functions?

I have been thinking a bit about the error handling in rule actions and action helpers and I have the proposition of either allowing the c_generator to automatically create functions from whatever you include between the { } (harder) or to force ourselves into just calling functions defined in the pegen.c file or another compilation unit (easier). In this way, we can do proper error handling and teach the C generator to propagate errors. This becomes more and more important as we implement more rules because failures on the helper usually only appear much later as corrupted objects in deallocation procedures or similar.

What do you think?

Reduce the number of `void*`in the c generator

There is a bunch of places where we use (probably) too many void* types and this stops gcc from reporting multiple errors due to passing incorrect AST types from one side to the other. Many of the errors that can actually be cough are emitted as warnings.

In some cases, we know the return type of the function so we known that anything that is assigned to the resulting value (many cases named res) must have the same type.

Another possibility, for the time being, is transform some of the warnings into errors via compiler flags.

The reason I am raising this issue is that debugging these cases almost always implies dealing with segmentation faults that originate from CPython when it tries to make use of some corrupted nodes.

Questions about import rules

I am currently trying to refactor simpy.gram to correctly generate the AST for import statements, which was once again proven to be a bit more difficult than expected. I have two main questions:

First, there is the rule

dotted_as_name: dotted_name ['as' NAME]

which the way as I see it could have two possible return types, either expr_ty or alias_ty according to whether an alias is defined or not. Is this correct or do we return alias_ty in both cases? If yes, then how do we handle two possible return types, maybe define void * as the return type?

Second question has to do with the rule:

dotted_name: NAME ('.' NAME)*

Do we need to write something like ast_for_dotted_name to handle that or is there an easier way that I don't see?

Consider a faster way to run "make simpy_cpython" in Travis

According to the latest run, it takes 80 seconds to get a clean checkout of the cpython repo. That holds up the build by about that much: the total build time was 137 seconds, the next longest was 57 seconds.

Maybe we should just check in a tarball of cpython/Lib and untar that? It should be much quicker.

Avoid accidental clashes between names in grammar and infrastructure

Items in an alternative may be named, and the names may be referenced in actions. But there are some "forbidden" names. E.g. don't name an item p, because (when generating C code) every rule parsing function has an argument named p. There are other possible name clashes too: res and mark are always local variables. And there are many helper functions, with names like is_memoized or singleton_seq. And of course anything that's a C reserved word (e.g. if) cannot be used either. Also there are systematic generated names like *_var, *_type and *_rule.

It's easy to rename p, mark and res in the generated code to start with an underscore (by convention, rule names don't start with _, and maybe we should make this a hard requirement). I'm not sure we need to worry about the others, though we may have to warn about them in the docs.

When generating Python code there are other possible clashes, e.g. self, mark and cut. We can handle these the same way. (There are a few others that seem less important, like ast and of course Python builtins and keywords.)

Ideas for optimizing the grammar

(From @apalala)

The characteristics of performance problems in a grammar are too many failed options/rules, and failed options with tracebacks that are very deep.

Applying the "cut" expression should take care of most problems, because we know Python remains LL(small-K).

For later, option reordering is a common way to optimize. Options that produce the most "hits" should come first in rules and groups.

The above is a combination of failing early (cut), and succeeding early.

Problem with adding actions to simpy.gram

Hello,

I am currently trying to add actions for the pass statement into simpy.gram. That was proven a little bit more difficult than I expected due to the following problem:

A simple statement can contain more than one statements, thus its return type should be asdl_seq *, which is propagated to the statement rule, whose return type also becomes asdl_seq *.
My question is, how can the statements rule handle such a return type from statement? Going through the generated code, it seems like the loop for statements expects stmt_ty as the return type of statement.

The relevant grammar rules are these:

start[mod_ty]: a=[statements] ENDMARKER { Module(a, NULL, p->arena) }
statements[asdl_seq*]: a=statement+ { a }
statement[asdl_seq*]:  compound_stmt | simple_stmt
simple_stmt[asdl_seq*]: a=small_stmt b=further_small_stmt* [';'] NEWLINE { seq_insert_in_front(p, b, a) }
further_small_stmt[stmt_ty]: ';' a=small_stmt { a }
small_stmt[stmt_ty]: ( return_stmt | import_stmt | pass_stmt | raise_stmt | yield_stmt | assert_stmt | del_stmt | global_stmt | nonlocal_stmt
                     | assignment | expressions
                     )
pass_stmt[stmt_ty]: a='pass' { _Py_Pass(EXTRA(a, a)) }

Surely, the action for statements is wrong and should be something else, but I can't really figure out what.

Side Note: seq_insert_in_front is a function I wrote in pegen.c, which accepts the asdl_seq * generated by the loop in simple_stmt and prepends the first small_stmt.

Weird bug with Global and Nonlocal Statements

I'm currently trying to get all the *_stmt rules to work and I have stumbled upon a very weird bug with the global and non-local statements.

When I'm trying to parse a very simple statement like global a, b, it succeeds, but the very next statement, whatever it is, fails with a segmentation fault. Even just pressing the Tab key to autocomplete throws a SEGFAULT.

It seems that something is needed that was located in the arena block created in run_parser. Could this be a bug in CPython?

==18593== Invalid read of size 8
==18593==    at 0x1B3FC3: _PyObject_IsFreed (object.c:448)
==18593==    by 0x294C3B: visit_decref (gcmodule.c:379)
==18593==    by 0x18C72E: list_traverse (listobject.c:2632)
==18593==    by 0x294008: subtract_refs (gcmodule.c:406)
==18593==    by 0x295556: collect (gcmodule.c:1054)
==18593==    by 0x295EBF: collect_with_callback (gcmodule.c:1240)
==18593==    by 0x296173: collect_generations (gcmodule.c:1262)
==18593==    by 0x296250: _PyObject_GC_Alloc (gcmodule.c:1977)
==18593==    by 0x296C0E: _PyObject_GC_Malloc (gcmodule.c:1987)
==18593==    by 0x296CE9: _PyObject_GC_NewVar (gcmodule.c:2016)
==18593==    by 0x1C69F6: PyTuple_New (tupleobject.c:118)
==18593==    by 0x1DA0B5: mro_implementation (typeobject.c:1751)
==18593==    by 0x1DAE61: type_mro_impl (typeobject.c:1818)
==18593==    by 0x1DAED3: type_mro (typeobject.c.h:76)
==18593==    by 0x32B3B4: method_vectorcall_NOARGS (descrobject.c:393)
==18593==    by 0x1CBEC4: _PyObject_Vectorcall (abstract.h:127)
==18593==    by 0x1CBEC4: _PyObject_FastCall (abstract.h:147)
==18593==    by 0x1CBEC4: call_unbound_noarg (typeobject.c:1460)
==18593==    by 0x1DA341: mro_invoke (typeobject.c:1888)
==18593==    by 0x1DA489: mro_internal (typeobject.c:1943)
==18593==    by 0x1D1EED: PyType_Ready (typeobject.c:5337)
==18593==    by 0x1D9D50: type_new (typeobject.c:2811)
==18593==    by 0x1D0D4D: type_call (typeobject.c:969)
==18593==    by 0x1782CA: _PyObject_MakeTpCall (call.c:159)
==18593==    by 0x17A0C6: _PyObject_FastCallDict (call.c:91)
==18593==    by 0x36687F: builtin___build_class__ (bltinmodule.c:231)
==18593==    by 0x1B1511: cfunction_vectorcall_FASTCALL_KEYWORDS (methodobject.c:436)
==18593==  Address 0x538dc08 is 680 bytes inside a block of size 8,248 free'd
==18593==    at 0x483BA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==18593==    by 0x1B8290: _PyMem_RawFree (obmalloc.c:127)
==18593==    by 0x1B8978: _PyMem_DebugRawFree (obmalloc.c:2201)
==18593==    by 0x1B89A0: _PyMem_DebugFree (obmalloc.c:2331)
==18593==    by 0x1B982A: PyMem_Free (obmalloc.c:629)
==18593==    by 0x26B9B1: block_free (pyarena.c:95)
==18593==    by 0x26BB39: PyArena_Free (pyarena.c:169)
==18593==    by 0x54B5BF4: run_parser (pegen.c:388)
==18593==    by 0x54B5D34: run_parser_from_string (pegen.c:436)
==18593==    by 0x54B670D: parse_string (parse.c:9510)
==18593==    by 0x1776C6: cfunction_call_varargs (call.c:757)
==18593==    by 0x17A83B: PyCFunction_Call (call.c:772)
==18593==    by 0x1782CA: _PyObject_MakeTpCall (call.c:159)
==18593==    by 0x236DF9: _PyObject_Vectorcall (abstract.h:125)
==18593==    by 0x236DF9: call_function (ceval.c:4987)
==18593==    by 0x236DF9: _PyEval_EvalFrameDefault (ceval.c:3469)
==18593==    by 0x22A84F: PyEval_EvalFrameEx (ceval.c:741)
==18593==    by 0x22B3B9: _PyEval_EvalCodeWithName (ceval.c:4298)
==18593==    by 0x22B53A: PyEval_EvalCodeEx (ceval.c:4327)
==18593==    by 0x22B56C: PyEval_EvalCode (ceval.c:718)
==18593==    by 0x27330C: run_eval_code_obj (pythonrun.c:1117)
==18593==    by 0x273731: run_mod (pythonrun.c:1139)
==18593==    by 0x27628E: PyRun_InteractiveOneObjectEx (pythonrun.c:259)
==18593==    by 0x27664C: PyRun_InteractiveLoopFlags (pythonrun.c:121)
==18593==    by 0x276D22: PyRun_AnyFileExFlags (pythonrun.c:80)
==18593==    by 0x167564: pymain_run_stdin (main.c:479)
==18593==    by 0x168171: pymain_run_python (main.c:568)
==18593==  Block was alloc'd at
==18593==    at 0x483A7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==18593==    by 0x1B8318: _PyMem_RawMalloc (obmalloc.c:99)
==18593==    by 0x1B88CF: _PyMem_DebugRawAlloc (obmalloc.c:2134)
==18593==    by 0x1B890B: _PyMem_DebugRawMalloc (obmalloc.c:2167)
==18593==    by 0x1B8930: _PyMem_DebugMalloc (obmalloc.c:2316)
==18593==    by 0x1B979A: PyMem_Malloc (obmalloc.c:605)
==18593==    by 0x26B979: block_new (pyarena.c:80)
==18593==    by 0x26BAA2: PyArena_New (pyarena.c:134)
==18593==    by 0x54B5A2A: run_parser (pegen.c:343)
==18593==    by 0x54B5D34: run_parser_from_string (pegen.c:436)
==18593==    by 0x54B670D: parse_string (parse.c:9510)
==18593==    by 0x1776C6: cfunction_call_varargs (call.c:757)
==18593==    by 0x17A83B: PyCFunction_Call (call.c:772)
==18593==    by 0x1782CA: _PyObject_MakeTpCall (call.c:159)
==18593==    by 0x236DF9: _PyObject_Vectorcall (abstract.h:125)
==18593==    by 0x236DF9: call_function (ceval.c:4987)
==18593==    by 0x236DF9: _PyEval_EvalFrameDefault (ceval.c:3469)
==18593==    by 0x22A84F: PyEval_EvalFrameEx (ceval.c:741)
==18593==    by 0x22B3B9: _PyEval_EvalCodeWithName (ceval.c:4298)
==18593==    by 0x22B53A: PyEval_EvalCodeEx (ceval.c:4327)
==18593==    by 0x22B56C: PyEval_EvalCode (ceval.c:718)
==18593==    by 0x27330C: run_eval_code_obj (pythonrun.c:1117)
==18593==    by 0x273731: run_mod (pythonrun.c:1139)
==18593==    by 0x27628E: PyRun_InteractiveOneObjectEx (pythonrun.c:259)
==18593==    by 0x27664C: PyRun_InteractiveLoopFlags (pythonrun.c:121)
==18593==    by 0x276D22: PyRun_AnyFileExFlags (pythonrun.c:80)
==18593==    by 0x167564: pymain_run_stdin (main.c:479)
==18593== 
Modules/gcmodule.c:379: visit_decref: Assertion "!_PyObject_IsFreed(op)" failed

Write a function to compare the abstract aspect of AST

A discussion about comparing AST in SO:
https://stackoverflow.com/a/59144406/545637

We can now use mypy's master (though not yet from pypy)

@pablogsal

Simpy refactorings: primary

Fully annotate test/test_pegen.py

This was left out of #56.

Better tests for mutual left-recursion in C generated parsers

There is a suspicion that the generated code (or the support code) for mutually left-recursive rules is broken in C. It would be good to at least test this using the same tests as used to test this for the Python generator.

Those tests are at https://github.com/gvanrossum/pegen/blob/master/test/test_pegen.py#L376-L444.

Simpy refactorings: star_targets

Validate implementation of the 'cut' operator

The 'cut' operator serves two purposes in PEG (and other types of grammars):

It commits the parser to the current alternative, so further alternatives will not be tried.

The above is important because a parser must deal with valid programs, and also with not-valid ones. The cut operator is the best way to make the parser commit to failure fast (an intentionally ambiguous pseudo-Python source could make the parser backtrack a lot).

It allows to prune the memoization cache, saving memory.

Optimization: Multiple reserved keyword lists by length

For future reference.

As proposed by @gvanrossum in #174.

“make simpy_cpython” fails on Ubuntu and with Python built from source

I just found out that make simpy_cpython fails on my Ubuntu machine, because the Python version I'm using was build from source --with-pydebug and this assertion fails in Objects/typeobject.c.

I think that the file and line that's causing the failure is Lib/multiprocessing/pool.py on line 517.

It seems that this is somehow related to a = [*b()] failing to parse at the moment with a ValueError: field value is required for Starred, which I cannot figure out the reason for. Any ideas?

NamedItems with same name must also have the same type in different Alts

It seems that due to https://github.com/gvanrossum/pegen/blob/c97dc21f674fa3e7a5fd78bd1df9708fde65fc5c/pegen/c_generator.py#L336 updating the same dict for all alternatives, we cannot have the same variable name have different types in different alts of a rule. Is that so desired?

I discovered it trying to refactor the rules for import_from like so:

import_from[stmt_ty]:
    | a='from' b=dot_or_ellipsis* !'import' c=dotted_name 'import' d=import_as_names_from { _Py_ImportFrom(get_identifier_from_expr_ty(c), d, seq_count_dots(b), EXTRA(a, d)) }
    | a='from' b=dot_or_ellipsis+ 'import' c=import_as_names_from { _Py_ImportFrom(NULL, c, seq_count_dots(b), EXTRA(a, c)) }

It seems reasonable to have the name c for different types of rules in both alts.

add argument to turn off AST generation and measure just parsing

When refactoring PEG grammars there may be rewrites that ease code generation, but hurt the performance of the parser, because PEG is more algorithmic than denotational or functional.

The original sympy grammar was performant to the levels of pgen, and there are no theoretical reasons why pegen can't keep being as performant.

To measure, it's important to have a way to only parse, thus the request for options that make it easy to bypass code generation in unit tests and system tests. After finding rules that are taking more than expected, reordering of clauses and/or lookaheads often return the parser to the previous performance.

(cc @gvanrossum, as discussed on email)

pretty-print

It would be great if there was a tool that produced documentation-ready grammars from the AST generating grammar.

Also important is that as new people come into the project to help with the grammar refactoring towards AST generation, there is one-and-only one way to rightly format the grammar (as with Python code, and as in CPython C code).

A pretty-printer can solve the above, and would take a short time to write, because there's an OO representation and walkers/visitors for all grammars.

(cc @gvanrossum as discussed by email)

Refactor Grammar for publishing on Python Docs

In case the grammar gets published to the official Python docs, it should be as easy as possible for people to understand it.

This issue should act as a reminder to do the necessary refactoring towards that direction, when/if the grammar gets uploaded.

Switch to (fragments of) Python grammar from PyGL

It would be good if pegen switched from using fragments of the sketch of the PEG Python grammar in the TatSu examples (which came from an untested ANTLR grammar), and switched to fragments from the one in PyGL (which I've been testing, and came from Grammar/Grammar, which is authoritative).

@apalala (Since I'm quoting you here. :-)

The type of rules cannot be simple primitives

While working on #100 I realized that with our current c-parser, the type of rules cannot be simple types that cannot be aliased to pointers. For example, operator_ty is a simple integer and the code
generated by the c generator is invalid as integers cannot be assigned to NULL and the memoization functions require also void*.

I open this issue to discuss how we should proceed in these cases.

Can we support named variables inside groups?

Suppose we have this rule:

rule: foo [bar [baz]] { Thing(bar, baz) }

This currently doesn't work because the scope of the bar and baz variable is just the smallest enclosing group. We can handle a simpler case just fine:

rule: foo b=[bar] { Thing(b, NULL) }

because the optional rule will return NULL if it isn't present. But extending this pattern to the first example would require returning a tuple, like this:

rule: foo t=[b1=bar b2=[baz] {tuple(b1, b2)}] { t ? Thing(t.b1, t.b2) : Thing(NULL, NULL) }

If we changed the scoping rules so that we could extract values directly from nested groups (as long as they physically appear in the alternative) we could write this directly as

rule: foo [b1=bar [b2=baz]] { Thing(b1, b2) }

Cleanup: delete TODO.md and notes.md

Instead of TODO.md we can use GitHub issues.

notes.md is just historical stuff, no longer relevant (and it'll always be in the git history).

parse_string segfaults when using mode=2

Due to tok->filename being NULL, when parse_string(..., mode=2) is called, Python/compile.c:336 segfaults. What's the best course of action here?

Group line and column metadata needs special handling

When writing the action for the group rule, I found out that parsing something like (yield a) returns the following AST object:

Module(
    body=[
        Expr(
            value=Yield(
                value=Name(
                    id="a", ctx=Load(), lineno=1, col_offset=7, end_lineno=1, end_col_offset=8
                ),
                lineno=1,
                col_offset=1,
                end_lineno=1,
                end_col_offset=8,
            ),
            lineno=1,
            col_offset=0,
            end_lineno=1,
            end_col_offset=9,
        )
    ],
    type_ignores=[],
)

That means that the Expr node's start column offset is 0, although the yield node's is 1. The correct action for group is thus:

group[expr_ty]: '(' a=(yield_expr | full_expression) ')' { a }

but the surrounding Expr node needs to have a different column offset and I can't seem to come up with a way of handling this, so any ideas are welcome!

Simpy refactorings: F-Strings

In the atom rule, the STRING+ alternative still needs to be refactored.

Propagate errors from grammar actions

Currently, we don't have a way to correctly propagate errors in grammar actions helpers as written as these must be written as simple expressions. Solving this is important because these helpers perform many C-API calls that can potentially fail (such as memory allocation, Unicode manipulation...etc) and without explicitly checking for errors these failures will pass unnoticed.

Create a test set of not-valid Python

A parser’s behavior is not fully tested until it is fed with enough bad input. With PEG in particular, some types of incorrect input may make the parser backtrack a lot (strategically placed 'cut' operators help avoid that).

It may be possible to create such test set automatically, by corrupting some valid programs.

Problem (?) with EXTRA in import_name and import_from

Going through all the things I did for #67 and #71, I realised that there might be a slight problem we didn't catch, but I am not sure.

Let me explain with the example of the import_name rule:

import_name[stmt_ty]: a='import' b=dotted_as_names { _Py_Import(b, EXTRA(a, b)) }

b is an asdl_seq * and we use it as the tail parameter to EXTRA, which calls ENDLINE and ENDCOL for b, which then access the end_lineno and end_col_offset attributes after casting the asdl_seq * to an expr_ty. I don't really see how that doesn't generate a SEGFAULT, but (SEGFAULTS aside) I think it will probably be a bug in the future, when these AST nodes are used for error reporting.

Am I maybe missing something?

Create tests for data/simpy.gram

We should add automated tests (and run them on Travis-CI, see .travis.yml) that verify that data/simpy.gram can parse all of the CPython standard library code.

If there is a CPython checkout at $CPYTHON, then this test would be equivalent to running

make simpy TESTDIR=$CPYTHON/Lib

If getting a CPython checkout in Travis is too slow, we could create a gzipped tarball of just $CPYTHON/Lib and check that into the pegen repo, and as part of the test untar it and then run something like the above make command.

Handle reserved words like the old parser

In the old parser, reserved words (e.g. if) are always reserved. We should mimic this behavior for most keywords, otherwise people will start abusing the feature. But there should be a way to turn it off for specific situations, to support e.g. async def and await (and in the future to support other "compound keywords", like match of, if we want to go there).

Windows support :(

Unfortunately it appears that some needed functions are not exported by CPython, thus are unavailable on Windows.

pegen.obj : error LNK2001: unresolved external symbol _Py_asdl_seq_new
pegen.obj : error LNK2001: unresolved external symbol PyAST_mod2obj
pegen.obj : error LNK2001: unresolved external symbol PyTokenizer_Free
pegen.obj : error LNK2001: unresolved external symbol PyTokenizer_FromString
pegen.obj : error LNK2001: unresolved external symbol PyTokenizer_Get
pegen.obj : error LNK2001: unresolved external symbol _Py_Name
pegen.obj : error LNK2001: unresolved external symbol PyTokenizer_FromFile
pegen.obj : error LNK2001: unresolved external symbol _Py_Constant
parse.obj : error LNK2001: unresolved external symbol _Py_BinOp
parse.obj : error LNK2001: unresolved external symbol _Py_Module
parse.obj : error LNK2001: unresolved external symbol _Py_Call
parse.obj : error LNK2001: unresolved external symbol _Py_Expr
parse.obj : error LNK2001: unresolved external symbol _Py_If
parse.obj : error LNK2001: unresolved external symbol _Py_Pass

This isn't really a request for support, just mentioning this. Ideally pegen gets upstreamed soon :P

we-like-parsers / pegen_experiments Goto Github PK

pegen_experiments's People

Contributors

Stargazers

Watchers

Forkers

pegen_experiments's Issues

Recommend Projects

Recommend Topics

Recommend Org