we-like-parsers / pegen_experiments Goto Github PK
View Code? Open in Web Editor NEWExperiments for the official PEG parser generator for Python
Home Page: https://github.com/python/cpython/tree/master/Tools/peg_generator
License: Other
Experiments for the official PEG parser generator for Python
Home Page: https://github.com/python/cpython/tree/master/Tools/peg_generator
License: Other
When executing:
python -m pytest -v -k test_same_name_different_types
I get a segmentation fault when loading the generated extension:
============================================================= test session starts =============================================================
platform linux -- Python 3.8.0+, pytest-5.2.2, py-1.8.0, pluggy-0.13.1 -- /home/pablogsal/github/pegen/../python/3.8/python
cachedir: .pytest_cache
rootdir: /home/pablogsal/github/pegen, inifile: pytest.ini
plugins: cov-2.8.1
collected 174 items / 173 deselected / 1 selected
test/test_c_parser.py::test_same_name_different_types Fatal Python error: Segmentation fault
Current thread 0x00007f437be02740 (most recent call first):
File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
File "<frozen importlib._bootstrap_external>", line 1101 in create_module
File "<frozen importlib._bootstrap>", line 556 in module_from_spec
File "./pegen/testutil.py", line 62 in import_file
File "./pegen/testutil.py", line 84 in generate_parser_c_extension
File "/home/pablogsal/github/pegen/test/test_c_parser.py", line 32 in verify_ast_generation
File "/home/pablogsal/github/pegen/test/test_c_parser.py", line 229 in test_same_name_different_types
File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/python.py", line 170 in pytest_pyfunc_call
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/python.py", line 1423 in runtest
File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/runner.py", line 125 in pytest_runtest_call
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/runner.py", line 201 in <lambda>
File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/runner.py", line 229 in from_call
File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/runner.py", line 200 in call_runtest_hook
File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/runner.py", line 176 in call_and_report
File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/runner.py", line 95 in runtestprotocol
File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/runner.py", line 80 in pytest_runtest_protocol
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/main.py", line 258 in pytest_runtestloop
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/main.py", line 237 in _main
File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/main.py", line 193 in wrap_session
File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/main.py", line 230 in pytest_cmdline_main
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/pablogsal/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/pablogsal/.local/lib/python3.8/site-packages/_pytest/config/__init__.py", line 90 in main
File "/home/pablogsal/.local/lib/python3.8/site-packages/pytest.py", line 101 in <module>
File "/home/pablogsal/github/python/3.8/Lib/runpy.py", line 86 in _run_code
File "/home/pablogsal/github/python/3.8/Lib/runpy.py", line 193 in _run_module_as_main
[1] 19965 segmentation fault (core dumped) ../python/3.8/python -m pytest -v -k test_same_name_different_types
When you try pegen with Python 3.9, you may get an err like this (in Mac; other platforms may vary):
$ make test
python3 -m pegen -q -c data/cprog.gram -o pegen/parse.c --compile-extension
python3 -c "from pegen import parse; t = parse.parse_file('data/cprog.txt'); exec(compile(t, '', 'exec'))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: dlopen(/Users/guido/pegen/pegen/parse.cpython-39-darwin.so, 2): Symbol not found: _PyAST_mod2obj
Referenced from: /Users/guido/pegen/pegen/parse.cpython-39-darwin.so
Expected in: flat namespace
in /Users/guido/pegen/pegen/parse.cpython-39-darwin.so
make: *** [test] Error 1
With the help of Victor Stinner I've found that this is due to the addition of -fvisibility=hidden
to CONFIGURE_CFLAGS_NODIST
in CPython's Makefile. This flag[1] hides many of the functions we're using, from PyAST_mod2obj
via PyTokenizer_Free
to _Py_BinOp
. All those functions are not part of the public C API, and the new flag hides them from extensions modules like pegen.
The best thing I've come up with is to just remove that flag from CPython's Makefile, make clean
, and make altinstall
.
A serious consequence of this is that the C code generator will really only be useful once it's been integrated into CPython. I don't want to spend effort on convincing the CPython core devs that we should make all those APIs public.
Here's a patch to CPython's configure
file that will preserve the change when the Makefile is rebuilt (but not when configure.in is rebuilt):
diff --git a/configure b/configure
index 44f14c3c2c..1901b6edc7 100755
--- a/configure
+++ b/configure
@@ -7377,10 +7377,10 @@ fi
{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_enable_visibility" >&5
$as_echo "$ac_cv_enable_visibility" >&6; }
- if test $ac_cv_enable_visibility = yes
- then
- CFLAGS_NODIST="$CFLAGS_NODIST -fvisibility=hidden"
- fi
+ ## if test $ac_cv_enable_visibility = yes
+ ## then
+ ## CFLAGS_NODIST="$CFLAGS_NODIST -fvisibility=hidden"
+ ## fi
# if using gcc on alpha, use -mieee to get (near) full IEEE 754
# support. Without this, treatment of subnormals doesn't follow
[1] See http://gcc.gnu.org/wiki/Visibility for more on visibility. It's a GCC 4.0+ feature.
I'm in the process of writing some really simple docs. What I've done so far is write some basic docs for all the external helper functions and all the helper structs, "stealing" some of the things written in notes.md
.
I've also created a file called grammar.py
, but I've only described the styling I'm using when refactoring the grammar rules.
Do you maybe have any other ideas on what should be included in these early-stage docs?
I'm currently working on getting BinOp
s, BoolOp
s and comparisons to work, but I don't really want to refactor the primary rule just yet, which means that if there is something more than just an atom in the expression, CONSTRUCTOR
gets called, whose result gets propagated up to the BoolOp
s. I'm assuming that's why I'm getting a SystemError: unknown boolop found
when running simpy_cpython
. Any ideas on how we could circumvent this for now?
For some nodes, the context (ctx
) needs to be assigned correctly (and in some cases, this may involve a recursive traversal). Currently, we only have a helper expr_ty store_name(Parser *p, expr_ty load_name)
to assign Store
context to names, but we need a more general solution.
The following cases for an expr_ty
need to be handled:
Attribute_kind
Subscript_kind
Starred_kind
Name_kind
List_kind
Tuple_kind
The relevant code in CPython lives in:
https://github.com/python/cpython/blob/master/Python/ast.c#L1124
In order for SyntaxError to be formatted with a nice display of the line, like this,
File "<unknown>", line 1
1/
^
SyntaxError: invalid syntax
we must set the text
attribute to the line containing the error.
Up until now I was under the impression that the expressions
rule should return either a tuple, if there are multiple expressions or the expression itself, if there only is one.
But trying to refactor the various atom rules, I saw this:
set: '{' expressions '}'
which means that expressions should return an asdl_seq *
of all the parsed expressions?
What is the right approach here?
The model and the code generators make heavy use of Rhs
, but RHS in general means just " right hand side", with no hint about the type. In the model Rhs
is intended to be the right hand side of rules, but the node type of that, after simple optimizations, could be any of the node types [Cut, Named, StringLeaf, Lookahead, PositiveLookahead, NegativeLookahead, Choice, Alt, Seq, Opt, Repeat0, Repeat1, Group]
.
It's best to drop Rhs
and stick to the node types that describe types of expressions. That way the visitors, specially the code generators, can have specific actions for each type.
This goes hand in hand with removing synthetic identifiers for expressions in rules, because they complicate the code generators to benefit only toy parsers. Most subexpressions on the right hand side of a rule are discardable, and name=exp
is enough to explicitly recover the elements relevant to the translation.
Now that #47 is closed and actions for simpy.gram
are ready, I think it'd be helpful to gather thoughts on what's coming next and maybe open issues etc. Here are some things I think are essential:
simpy_cpython
pass with TESTFLAGS=-stt
.pgen
.Have I missed something important?
I just discovered that make test
prints the output as
"Hello ""world"
Using bisection I found that it (correctly) printed
Hello world
before 3be531c (#133). That PR refactored string parsing thoroughly. I presume something went wrong. @pablogsal Can you look into this?
There's a complete Python grammar in data/simpy.gram
, but it is lacking actions.
OTOH there are actions (written in C) for a very small subset in data/cprog.gram
-- these are tested by the Makefile
targets test
, compile
, dump
and time
. The actions construct accurate AST nodes.
We should add actions to simpy.gram
modeled after those in cprog.gram
.
This will likely require refactoring the grammar in simpy.gram
.
As noted in #67 the construct X (sep X)*
gets heavily used by the current grammar, at least 16 times according to a quick search on my editor. It would be a good idea to add a custom rule to handle this construct, so that we don't have to use seq_insert_in_front
every time, as it creates a brand new asdl_seq*
. What would the best design be for such a rule?
I got to implementing the actions for while
and for
and found out that, upon parsing a NAME, name_token
gets called, which always uses Load
. Except for the quick workaround
target: a=NAME { _Py_Name(a->v.Name.id, Store, EXTRA_EXPR(a, a)) }
there is unfortunately no other way to use Store for a NAME.
What do you think is the best way to implement this?
I have been thinking a bit about the error handling in rule actions and action helpers and I have the proposition of either allowing the c_generator to automatically create functions from whatever you include between the {
}
(harder) or to force ourselves into just calling functions defined in the pegen.c
file or another compilation unit (easier). In this way, we can do proper error handling and teach the C generator to propagate errors. This becomes more and more important as we implement more rules because failures on the helper usually only appear much later as corrupted objects in deallocation procedures or similar.
What do you think?
There is a bunch of places where we use (probably) too many void*
types and this stops gcc
from reporting multiple errors due to passing incorrect AST
types from one side to the other. Many of the errors that can actually be cough are emitted as warnings.
In some cases, we know the return type of the function so we known that anything that is assigned to the resulting value (many cases named res
) must have the same type.
Another possibility, for the time being, is transform some of the warnings into errors via compiler flags.
The reason I am raising this issue is that debugging these cases almost always implies dealing with segmentation faults that originate from CPython when it tries to make use of some corrupted nodes.
I am currently trying to refactor simpy.gram
to correctly generate the AST for import statements, which was once again proven to be a bit more difficult than expected. I have two main questions:
First, there is the rule
dotted_as_name: dotted_name ['as' NAME]
which the way as I see it could have two possible return types, either expr_ty
or alias_ty
according to whether an alias is defined or not. Is this correct or do we return alias_ty
in both cases? If yes, then how do we handle two possible return types, maybe define void *
as the return type?
Second question has to do with the rule:
dotted_name: NAME ('.' NAME)*
Do we need to write something like ast_for_dotted_name
to handle that or is there an easier way that I don't see?
According to the latest run, it takes 80 seconds to get a clean checkout of the cpython repo. That holds up the build by about that much: the total build time was 137 seconds, the next longest was 57 seconds.
Maybe we should just check in a tarball of cpython/Lib and untar that? It should be much quicker.
Items in an alternative may be named, and the names may be referenced in actions. But there are some "forbidden" names. E.g. don't name an item p
, because (when generating C code) every rule parsing function has an argument named p
. There are other possible name clashes too: res
and mark
are always local variables. And there are many helper functions, with names like is_memoized
or singleton_seq
. And of course anything that's a C reserved word (e.g. if
) cannot be used either. Also there are systematic generated names like *_var
, *_type
and *_rule
.
It's easy to rename p
, mark
and res
in the generated code to start with an underscore (by convention, rule names don't start with _
, and maybe we should make this a hard requirement). I'm not sure we need to worry about the others, though we may have to warn about them in the docs.
When generating Python code there are other possible clashes, e.g. self
, mark
and cut
. We can handle these the same way. (There are a few others that seem less important, like ast
and of course Python builtins and keywords.)
(From @apalala)
The characteristics of performance problems in a grammar are too many failed options/rules, and failed options with tracebacks that are very deep.
Applying the "cut" expression should take care of most problems, because we know Python remains LL(small-K).
For later, option reordering is a common way to optimize. Options that produce the most "hits" should come first in rules and groups.
The above is a combination of failing early (cut), and succeeding early.
Hello,
I am currently trying to add actions for the pass
statement into simpy.gram
. That was proven a little bit more difficult than I expected due to the following problem:
A simple statement can contain more than one statements, thus its return type should be asdl_seq *
, which is propagated to the statement
rule, whose return type also becomes asdl_seq *
.
My question is, how can the statements
rule handle such a return type from statement
? Going through the generated code, it seems like the loop for statements
expects stmt_ty
as the return type of statement
.
The relevant grammar rules are these:
start[mod_ty]: a=[statements] ENDMARKER { Module(a, NULL, p->arena) }
statements[asdl_seq*]: a=statement+ { a }
statement[asdl_seq*]: compound_stmt | simple_stmt
simple_stmt[asdl_seq*]: a=small_stmt b=further_small_stmt* [';'] NEWLINE { seq_insert_in_front(p, b, a) }
further_small_stmt[stmt_ty]: ';' a=small_stmt { a }
small_stmt[stmt_ty]: ( return_stmt | import_stmt | pass_stmt | raise_stmt | yield_stmt | assert_stmt | del_stmt | global_stmt | nonlocal_stmt
| assignment | expressions
)
pass_stmt[stmt_ty]: a='pass' { _Py_Pass(EXTRA(a, a)) }
Surely, the action for statements
is wrong and should be something else, but I can't really figure out what.
Side Note: seq_insert_in_front
is a function I wrote in pegen.c
, which accepts the asdl_seq *
generated by the loop in simple_stmt
and prepends the first small_stmt
.
I'm currently trying to get all the *_stmt rules to work and I have stumbled upon a very weird bug with the global and non-local statements.
When I'm trying to parse a very simple statement like global a, b
, it succeeds, but the very next statement, whatever it is, fails with a segmentation fault. Even just pressing the Tab
key to autocomplete throws a SEGFAULT.
It seems that something is needed that was located in the arena block created in run_parser
. Could this be a bug in CPython?
==18593== Invalid read of size 8
==18593== at 0x1B3FC3: _PyObject_IsFreed (object.c:448)
==18593== by 0x294C3B: visit_decref (gcmodule.c:379)
==18593== by 0x18C72E: list_traverse (listobject.c:2632)
==18593== by 0x294008: subtract_refs (gcmodule.c:406)
==18593== by 0x295556: collect (gcmodule.c:1054)
==18593== by 0x295EBF: collect_with_callback (gcmodule.c:1240)
==18593== by 0x296173: collect_generations (gcmodule.c:1262)
==18593== by 0x296250: _PyObject_GC_Alloc (gcmodule.c:1977)
==18593== by 0x296C0E: _PyObject_GC_Malloc (gcmodule.c:1987)
==18593== by 0x296CE9: _PyObject_GC_NewVar (gcmodule.c:2016)
==18593== by 0x1C69F6: PyTuple_New (tupleobject.c:118)
==18593== by 0x1DA0B5: mro_implementation (typeobject.c:1751)
==18593== by 0x1DAE61: type_mro_impl (typeobject.c:1818)
==18593== by 0x1DAED3: type_mro (typeobject.c.h:76)
==18593== by 0x32B3B4: method_vectorcall_NOARGS (descrobject.c:393)
==18593== by 0x1CBEC4: _PyObject_Vectorcall (abstract.h:127)
==18593== by 0x1CBEC4: _PyObject_FastCall (abstract.h:147)
==18593== by 0x1CBEC4: call_unbound_noarg (typeobject.c:1460)
==18593== by 0x1DA341: mro_invoke (typeobject.c:1888)
==18593== by 0x1DA489: mro_internal (typeobject.c:1943)
==18593== by 0x1D1EED: PyType_Ready (typeobject.c:5337)
==18593== by 0x1D9D50: type_new (typeobject.c:2811)
==18593== by 0x1D0D4D: type_call (typeobject.c:969)
==18593== by 0x1782CA: _PyObject_MakeTpCall (call.c:159)
==18593== by 0x17A0C6: _PyObject_FastCallDict (call.c:91)
==18593== by 0x36687F: builtin___build_class__ (bltinmodule.c:231)
==18593== by 0x1B1511: cfunction_vectorcall_FASTCALL_KEYWORDS (methodobject.c:436)
==18593== Address 0x538dc08 is 680 bytes inside a block of size 8,248 free'd
==18593== at 0x483BA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==18593== by 0x1B8290: _PyMem_RawFree (obmalloc.c:127)
==18593== by 0x1B8978: _PyMem_DebugRawFree (obmalloc.c:2201)
==18593== by 0x1B89A0: _PyMem_DebugFree (obmalloc.c:2331)
==18593== by 0x1B982A: PyMem_Free (obmalloc.c:629)
==18593== by 0x26B9B1: block_free (pyarena.c:95)
==18593== by 0x26BB39: PyArena_Free (pyarena.c:169)
==18593== by 0x54B5BF4: run_parser (pegen.c:388)
==18593== by 0x54B5D34: run_parser_from_string (pegen.c:436)
==18593== by 0x54B670D: parse_string (parse.c:9510)
==18593== by 0x1776C6: cfunction_call_varargs (call.c:757)
==18593== by 0x17A83B: PyCFunction_Call (call.c:772)
==18593== by 0x1782CA: _PyObject_MakeTpCall (call.c:159)
==18593== by 0x236DF9: _PyObject_Vectorcall (abstract.h:125)
==18593== by 0x236DF9: call_function (ceval.c:4987)
==18593== by 0x236DF9: _PyEval_EvalFrameDefault (ceval.c:3469)
==18593== by 0x22A84F: PyEval_EvalFrameEx (ceval.c:741)
==18593== by 0x22B3B9: _PyEval_EvalCodeWithName (ceval.c:4298)
==18593== by 0x22B53A: PyEval_EvalCodeEx (ceval.c:4327)
==18593== by 0x22B56C: PyEval_EvalCode (ceval.c:718)
==18593== by 0x27330C: run_eval_code_obj (pythonrun.c:1117)
==18593== by 0x273731: run_mod (pythonrun.c:1139)
==18593== by 0x27628E: PyRun_InteractiveOneObjectEx (pythonrun.c:259)
==18593== by 0x27664C: PyRun_InteractiveLoopFlags (pythonrun.c:121)
==18593== by 0x276D22: PyRun_AnyFileExFlags (pythonrun.c:80)
==18593== by 0x167564: pymain_run_stdin (main.c:479)
==18593== by 0x168171: pymain_run_python (main.c:568)
==18593== Block was alloc'd at
==18593== at 0x483A7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==18593== by 0x1B8318: _PyMem_RawMalloc (obmalloc.c:99)
==18593== by 0x1B88CF: _PyMem_DebugRawAlloc (obmalloc.c:2134)
==18593== by 0x1B890B: _PyMem_DebugRawMalloc (obmalloc.c:2167)
==18593== by 0x1B8930: _PyMem_DebugMalloc (obmalloc.c:2316)
==18593== by 0x1B979A: PyMem_Malloc (obmalloc.c:605)
==18593== by 0x26B979: block_new (pyarena.c:80)
==18593== by 0x26BAA2: PyArena_New (pyarena.c:134)
==18593== by 0x54B5A2A: run_parser (pegen.c:343)
==18593== by 0x54B5D34: run_parser_from_string (pegen.c:436)
==18593== by 0x54B670D: parse_string (parse.c:9510)
==18593== by 0x1776C6: cfunction_call_varargs (call.c:757)
==18593== by 0x17A83B: PyCFunction_Call (call.c:772)
==18593== by 0x1782CA: _PyObject_MakeTpCall (call.c:159)
==18593== by 0x236DF9: _PyObject_Vectorcall (abstract.h:125)
==18593== by 0x236DF9: call_function (ceval.c:4987)
==18593== by 0x236DF9: _PyEval_EvalFrameDefault (ceval.c:3469)
==18593== by 0x22A84F: PyEval_EvalFrameEx (ceval.c:741)
==18593== by 0x22B3B9: _PyEval_EvalCodeWithName (ceval.c:4298)
==18593== by 0x22B53A: PyEval_EvalCodeEx (ceval.c:4327)
==18593== by 0x22B56C: PyEval_EvalCode (ceval.c:718)
==18593== by 0x27330C: run_eval_code_obj (pythonrun.c:1117)
==18593== by 0x273731: run_mod (pythonrun.c:1139)
==18593== by 0x27628E: PyRun_InteractiveOneObjectEx (pythonrun.c:259)
==18593== by 0x27664C: PyRun_InteractiveLoopFlags (pythonrun.c:121)
==18593== by 0x276D22: PyRun_AnyFileExFlags (pythonrun.c:80)
==18593== by 0x167564: pymain_run_stdin (main.c:479)
==18593==
Modules/gcmodule.c:379: visit_decref: Assertion "!_PyObject_IsFreed(op)" failed
A discussion about comparing AST in SO:
https://stackoverflow.com/a/59144406/545637
This was left out of #56.
There is a suspicion that the generated code (or the support code) for mutually left-recursive rules is broken in C. It would be good to at least test this using the same tests as used to test this for the Python generator.
Those tests are at https://github.com/gvanrossum/pegen/blob/master/test/test_pegen.py#L376-L444.
The 'cut' operator serves two purposes in PEG (and other types of grammars):
The above is important because a parser must deal with valid programs, and also with not-valid ones. The cut
operator is the best way to make the parser commit to failure fast (an intentionally ambiguous pseudo-Python source could make the parser backtrack a lot).
For future reference.
As proposed by @gvanrossum in #174.
I just found out that make simpy_cpython
fails on my Ubuntu machine, because the Python version I'm using was build from source --with-pydebug
and this assertion fails in Objects/typeobject.c
.
I think that the file and line that's causing the failure is Lib/multiprocessing/pool.py
on line 517.
It seems that this is somehow related to a = [*b()]
failing to parse at the moment with a ValueError: field value is required for Starred
, which I cannot figure out the reason for. Any ideas?
It seems that due to https://github.com/gvanrossum/pegen/blob/c97dc21f674fa3e7a5fd78bd1df9708fde65fc5c/pegen/c_generator.py#L336 updating the same dict for all alternatives, we cannot have the same variable name have different types in different alts of a rule. Is that so desired?
I discovered it trying to refactor the rules for import_from
like so:
import_from[stmt_ty]:
| a='from' b=dot_or_ellipsis* !'import' c=dotted_name 'import' d=import_as_names_from { _Py_ImportFrom(get_identifier_from_expr_ty(c), d, seq_count_dots(b), EXTRA(a, d)) }
| a='from' b=dot_or_ellipsis+ 'import' c=import_as_names_from { _Py_ImportFrom(NULL, c, seq_count_dots(b), EXTRA(a, c)) }
It seems reasonable to have the name c
for different types of rules in both alts.
When refactoring PEG grammars there may be rewrites that ease code generation, but hurt the performance of the parser, because PEG is more algorithmic than denotational or functional.
The original sympy
grammar was performant to the levels of pgen
, and there are no theoretical reasons why pegen
can't keep being as performant.
To measure, it's important to have a way to only parse, thus the request for options that make it easy to bypass code generation in unit tests and system tests. After finding rules that are taking more than expected, reordering of clauses and/or lookaheads often return the parser to the previous performance.
(cc @gvanrossum, as discussed on email)
It would be great if there was a tool that produced documentation-ready grammars from the AST generating grammar.
Also important is that as new people come into the project to help with the grammar refactoring towards AST generation, there is one-and-only one way to rightly format the grammar (as with Python code, and as in CPython C code).
A pretty-printer
can solve the above, and would take a short time to write, because there's an OO representation and walkers/visitors for all grammars.
(cc @gvanrossum as discussed by email)
In case the grammar gets published to the official Python docs, it should be as easy as possible for people to understand it.
This issue should act as a reminder to do the necessary refactoring towards that direction, when/if the grammar gets uploaded.
It would be good if pegen switched from using fragments of the sketch of the PEG Python grammar in the TatSu examples (which came from an untested ANTLR grammar), and switched to fragments from the one in PyGL (which I've been testing, and came from Grammar/Grammar, which is authoritative).
@apalala (Since I'm quoting you here. :-)
While working on #100 I realized that with our current c-parser, the type of rules cannot be simple types that cannot be aliased to pointers. For example, operator_ty
is a simple integer and the code
generated by the c generator is invalid as integers cannot be assigned to NULL
and the memoization functions require also void*
.
I open this issue to discuss how we should proceed in these cases.
Suppose we have this rule:
rule: foo [bar [baz]] { Thing(bar, baz) }
This currently doesn't work because the scope of the bar and baz variable is just the smallest enclosing group. We can handle a simpler case just fine:
rule: foo b=[bar] { Thing(b, NULL) }
because the optional rule will return NULL if it isn't present. But extending this pattern to the first example would require returning a tuple, like this:
rule: foo t=[b1=bar b2=[baz] {tuple(b1, b2)}] { t ? Thing(t.b1, t.b2) : Thing(NULL, NULL) }
If we changed the scoping rules so that we could extract values directly from nested groups (as long as they physically appear in the alternative) we could write this directly as
rule: foo [b1=bar [b2=baz]] { Thing(b1, b2) }
Instead of TODO.md we can use GitHub issues.
notes.md is just historical stuff, no longer relevant (and it'll always be in the git history).
Due to tok->filename
being NULL, when parse_string(..., mode=2)
is called, Python/compile.c:336 segfaults. What's the best course of action here?
When writing the action for the group
rule, I found out that parsing something like (yield a)
returns the following AST object:
Module(
body=[
Expr(
value=Yield(
value=Name(
id="a", ctx=Load(), lineno=1, col_offset=7, end_lineno=1, end_col_offset=8
),
lineno=1,
col_offset=1,
end_lineno=1,
end_col_offset=8,
),
lineno=1,
col_offset=0,
end_lineno=1,
end_col_offset=9,
)
],
type_ignores=[],
)
That means that the Expr
node's start column offset is 0, although the yield
node's is 1. The correct action for group
is thus:
group[expr_ty]: '(' a=(yield_expr | full_expression) ')' { a }
but the surrounding Expr
node needs to have a different column offset and I can't seem to come up with a way of handling this, so any ideas are welcome!
In the atom rule, the STRING+
alternative still needs to be refactored.
Currently, we don't have a way to correctly propagate errors in grammar actions helpers as written as these must be written as simple expressions. Solving this is important because these helpers perform many C-API calls that can potentially fail (such as memory allocation, Unicode manipulation...etc) and without explicitly checking for errors these failures will pass unnoticed.
A parser’s behavior is not fully tested until it is fed with enough bad input. With PEG in particular, some types of incorrect input may make the parser backtrack a lot (strategically placed 'cut' operators help avoid that).
It may be possible to create such test set automatically, by corrupting some valid programs.
Going through all the things I did for #67 and #71, I realised that there might be a slight problem we didn't catch, but I am not sure.
Let me explain with the example of the import_name
rule:
import_name[stmt_ty]: a='import' b=dotted_as_names { _Py_Import(b, EXTRA(a, b)) }
b
is an asdl_seq *
and we use it as the tail parameter to EXTRA
, which calls ENDLINE
and ENDCOL
for b
, which then access the end_lineno
and end_col_offset
attributes after casting the asdl_seq *
to an expr_ty
. I don't really see how that doesn't generate a SEGFAULT, but (SEGFAULTS aside) I think it will probably be a bug in the future, when these AST nodes are used for error reporting.
Am I maybe missing something?
We should add automated tests (and run them on Travis-CI, see .travis.yml
) that verify that data/simpy.gram
can parse all of the CPython standard library code.
If there is a CPython checkout at $CPYTHON
, then this test would be equivalent to running
make simpy TESTDIR=$CPYTHON/Lib
If getting a CPython checkout in Travis is too slow, we could create a gzipped tarball of just $CPYTHON/Lib
and check that into the pegen repo, and as part of the test untar it and then run something like the above make
command.
In the old parser, reserved words (e.g. if
) are always reserved. We should mimic this behavior for most keywords, otherwise people will start abusing the feature. But there should be a way to turn it off for specific situations, to support e.g. async def
and await
(and in the future to support other "compound keywords", like match of
, if we want to go there).
Unfortunately it appears that some needed functions are not exported by CPython, thus are unavailable on Windows.
pegen.obj : error LNK2001: unresolved external symbol _Py_asdl_seq_new
pegen.obj : error LNK2001: unresolved external symbol PyAST_mod2obj
pegen.obj : error LNK2001: unresolved external symbol PyTokenizer_Free
pegen.obj : error LNK2001: unresolved external symbol PyTokenizer_FromString
pegen.obj : error LNK2001: unresolved external symbol PyTokenizer_Get
pegen.obj : error LNK2001: unresolved external symbol _Py_Name
pegen.obj : error LNK2001: unresolved external symbol PyTokenizer_FromFile
pegen.obj : error LNK2001: unresolved external symbol _Py_Constant
parse.obj : error LNK2001: unresolved external symbol _Py_BinOp
parse.obj : error LNK2001: unresolved external symbol _Py_Module
parse.obj : error LNK2001: unresolved external symbol _Py_Call
parse.obj : error LNK2001: unresolved external symbol _Py_Expr
parse.obj : error LNK2001: unresolved external symbol _Py_If
parse.obj : error LNK2001: unresolved external symbol _Py_Pass
This isn't really a request for support, just mentioning this. Ideally pegen gets upstreamed soon :P
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.