Git Product home page Git Product logo

we-like-parsers / cpython Goto Github PK

View Code? Open in Web Editor NEW

This project forked from python/cpython

1.0 5.0 0.0 458.07 MB

Here we work on integrating pegen into CPython; use branch 'pegen'

Home Page: https://github.com/gvanrossum/pegen

License: Other

Makefile 0.06% Python 62.56% C 35.29% Batchfile 0.15% CSS 0.01% HTML 0.37% Assembly 0.10% C++ 0.72% DTrace 0.01% Shell 0.11% Roff 0.08% PLSQL 0.05% PowerShell 0.04% Objective-C 0.05% Common Lisp 0.05% M4 0.35% Rich Text Format 0.01% XSLT 0.01% VBScript 0.01%

cpython's Introduction

This is Python version 3.11.0 alpha 0

CPython build status on Travis CI CPython build status on GitHub Actions CPython build status on Azure DevOps Python Discourse chat

Copyright (c) 2001-2021 Python Software Foundation. All rights reserved.

See the end of this file for further copyright and license information.

For more complete instructions on contributing to CPython development, see the Developer Guide.

Installable Python kits, and information about using Python, are available at python.org.

On Unix, Linux, BSD, macOS, and Cygwin:

./configure
make
make test
sudo make install

This will install Python as python3.

You can pass many options to the configure script; run ./configure --help to find out more. On macOS case-insensitive file systems and on Cygwin, the executable is called python.exe; elsewhere it's just python.

Building a complete Python installation requires the use of various additional third-party libraries, depending on your build platform and configure options. Not all standard library modules are buildable or useable on all platforms. Refer to the Install dependencies section of the Developer Guide for current detailed information on dependencies for various Linux distributions and macOS.

On macOS, there are additional configure and build options related to macOS framework and universal builds. Refer to Mac/README.rst.

On Windows, see PCbuild/readme.txt.

If you wish, you can create a subdirectory and invoke configure from there. For example:

mkdir debug
cd debug
../configure --with-pydebug
make
make test

(This will fail if you also built at the top-level directory. You should do a make clean at the top-level first.)

To get an optimized build of Python, configure --enable-optimizations before you run make. This sets the default make targets up to enable Profile Guided Optimization (PGO) and may be used to auto-enable Link Time Optimization (LTO) on some platforms. For more details, see the sections below.

PGO takes advantage of recent versions of the GCC or Clang compilers. If used, either via configure --enable-optimizations or by manually running make profile-opt regardless of configure flags, the optimized build process will perform the following steps:

The entire Python directory is cleaned of temporary files that may have resulted from a previous compilation.

An instrumented version of the interpreter is built, using suitable compiler flags for each flavor. Note that this is just an intermediary step. The binary resulting from this step is not good for real-life workloads as it has profiling instructions embedded inside.

After the instrumented interpreter is built, the Makefile will run a training workload. This is necessary in order to profile the interpreter's execution. Note also that any output, both stdout and stderr, that may appear at this step is suppressed.

The final step is to build the actual interpreter, using the information collected from the instrumented one. The end result will be a Python binary that is optimized; suitable for distribution or production installation.

Enabled via configure's --with-lto flag. LTO takes advantage of the ability of recent compiler toolchains to optimize across the otherwise arbitrary .o file boundary when building final executables or shared libraries for additional performance gains.

We have a comprehensive overview of the changes in the What's New in Python 3.10 document. For a more detailed change log, read Misc/NEWS, but a full accounting of changes can only be gleaned from the commit history.

If you want to install multiple versions of Python, see the section below entitled "Installing multiple versions".

Documentation for Python 3.10 is online, updated daily.

It can also be downloaded in many formats for faster access. The documentation is downloadable in HTML, PDF, and reStructuredText formats; the latter version is primarily for documentation authors, translators, and people with special formatting requirements.

For information about building Python's documentation, refer to Doc/README.rst.

Significant backward incompatible changes were made for the release of Python 3.0, which may cause programs written for Python 2 to fail when run with Python 3. For more information about porting your code from Python 2 to Python 3, see the Porting HOWTO.

To test the interpreter, type make test in the top-level directory. The test set produces some output. You can generally ignore the messages about skipped tests due to optional features which can't be imported. If a message is printed about a failed test or a traceback or core dump is produced, something is wrong.

By default, tests are prevented from overusing resources like disk space and memory. To enable these tests, run make testall.

If any tests fail, you can re-run the failing test(s) in verbose mode. For example, if test_os and test_gdb failed, you can run:

make test TESTOPTS="-v test_os test_gdb"

If the failure persists and appears to be a problem with Python rather than your environment, you can file a bug report and include relevant output from that command to show the issue.

See Running & Writing Tests for more on running tests.

On Unix and Mac systems if you intend to install multiple versions of Python using the same installation prefix (--prefix argument to the configure script) you must take care that your primary python executable is not overwritten by the installation of a different version. All files and directories installed using make altinstall contain the major and minor version and can thus live side-by-side. make install also creates ${prefix}/bin/python3 which refers to ${prefix}/bin/pythonX.Y. If you intend to install multiple versions using the same prefix you must decide which version (if any) is your "primary" version. Install that version using make install. Install all other versions using make altinstall.

For example, if you want to install Python 2.7, 3.6, and 3.10 with 3.10 being the primary version, you would execute make install in your 3.10 build directory and make altinstall in the others.

Bug reports are welcome! You can use the issue tracker to report bugs, and/or submit pull requests on GitHub.

You can also follow development discussion on the python-dev mailing list.

If you have a proposal to change Python, you may want to send an email to the comp.lang.python or python-ideas mailing lists for initial feedback. A Python Enhancement Proposal (PEP) may be submitted if your idea gains ground. All current PEPs, as well as guidelines for submitting a new PEP, are listed at python.org/dev/peps/.

See PEP 619 for Python 3.10 release details.

Copyright (c) 2001-2021 Python Software Foundation. All rights reserved.

Copyright (c) 2000 BeOpen.com. All rights reserved.

Copyright (c) 1995-2001 Corporation for National Research Initiatives. All rights reserved.

Copyright (c) 1991-1995 Stichting Mathematisch Centrum. All rights reserved.

See the LICENSE for information on the history of this software, terms & conditions for usage, and a DISCLAIMER OF ALL WARRANTIES.

This Python distribution contains no GNU General Public License (GPL) code, so it may be used in proprietary projects. There are interfaces to some GNU code but these are entirely optional.

All trademarks referenced herein are property of their respective holders.

cpython's People

Contributors

gvanrossum avatar benjaminp avatar birkenfeld avatar freddrake avatar vstinner avatar rhettinger avatar serhiy-storchaka avatar pitrou avatar jackjansen avatar loewis avatar tim-one avatar akuchling avatar brettcannon avatar bitdancer avatar warsaw avatar ezio-melotti avatar mdickinson avatar nnorwitz avatar tiran avatar terryjreedy avatar gpshead avatar orsenthil avatar vsajip avatar merwok avatar jeremyhylton avatar 1st1 avatar berkerpeksag avatar ned-deily avatar gward avatar zooba avatar

Stargazers

Erlend E. Aasland avatar

Watchers

James Cloos avatar Rune Hansén Steinnes avatar Pablo Galindo Salgado avatar Filipe Laíns 🇵🇸 avatar Lysandros Nikolaou avatar

cpython's Issues

Run both parsers and compare output

On python-dev, Nathaniel just suggested that we could run both the old and the new parser and compare the trees, at least during the alpha/beta period. This would help us trust that the new compiler fulfills its promise. (When selecting the old parser we should not do this, of course. Some people may be testing for startup speed, for example.)

Upstream review

Is it up to people to step up to review our PR or should we explicitly ask for it? Maybe on python-dev?

Collect perf stats

From @apalala:

The new parser should compute its performance stats (even at a performance hit) while it is in beta. A switch would make the parser output those stats.

With that, the parser can issue a Python WARNING for any code with LOC/sec outside of a given threshold (successful or failed parse).

The developer who sees the warning will likely know how to follow through (specially if the warning points to a web page).

Thus we can gather a meaningful case of test cases.

It's an easy and nice way to go safe about it.

Tricky thing of course is to know what a reasonable speed is for the current computer. Also what timer should we use to avoid mistaking a suspended process for taking so long? ^Z is still a thing. :-)

Supporting type_comments=True

The ast.parse() function has two barely documented features, type_comments= and feature_version=, which modify the tokenizer and parser for the benefit of (primarily) mypy. type_comments=True changes the tokenizer to return TYPE_COMMENT tokens when it detects a type comment (a comment starting with 'type:'), and the old Grammar file has [TYPE_COMMENT] sprinkled throughout to support these.

In addition, feature_version=(3, N) enables a few changes in the tokenizer and ast.c to support older Python versions (mostly related to the three different stages of support for async and await keywords). [moved to #124]

We'll have to support these. I think it won't be very difficult since most of the logic is part of the tokenizer, but we'll have to check ast.c for other tests for feature_version. Note that these flags are passed in different forms to compile(), which has the actual support, but undocumented. See the source of ast.py.

Meta-issue: create a plan

Tentatively the plan could just be:

  • Add a non-functional command-line flag that can enable and disable the new parser
  • Add a PYTHON{SOMETHING} env variable to do the same
  • Add an API that invokes the new parser explicitly (and another that invokes the old parser)
  • Make the flag have the desired effect

Should we restrict somehow the nesting?

When investigating what is the current maximum nesting with the current grammar, I found this:

Figure_1

where test_case is this function:

def test_case(n):
    parser.parse_string("("*n + "1+1" + ")"*n)

This means that the parser will appear to hang if the nesting is >>25 as the timings grow exponentially.

What do you think we should do about this?

Add a use_peg field to PyCompilerFlags

Would it be too much to add a cf_use_peg_parser field to PyCompilerFlags? I think it'd make things simpler for #93 and we wouldn't need the not so pretty _PyInterpreterState_GET()->config assignment in pythonrun.c.

Fix Travis for upstream CPython

In upstream CPython, the Travis configuration uses Python 3.5 (for real) when running regeneration steps (PYTHON_FOR_REGEN) and that fails due to f-strings and other stuff in our generator. We need to upgrade the Travis file to use something more modern so the ci does not fail when merging upstream.

We should also restore all ci files that are used upstream so we know we are not missing something more.

We will probably need a PEP

Especially now that PyCon has been cancelled I expect that the best way to present our case to the core developers is through a PEP.

SyntaxError "unmatched ')'" if nesting too deep

Example:

import peg_parser
n=201
peg_parser.parse_string(n*'(' + ')'*n)

If n is 200 or less this produces no output; but for 201 or higher (up to 218) you get

SyntaxError: unmatched ')'

No error when parsing "f(**kwargs, *args)"

We are not throwing any error when parsing f(**kwargs, *args) while the old parser raises:

>>> f(**kwargs, *args)
  File "<stdin>", line 1
SyntaxError: iterable argument unpacking follows keyword argument unpacking

How to go about including pegen code in CPython?

How would you go about moving all the pegen code into cpython? What I have done so far is the following.

  1. Copy the pegen directory into cpython/Parser (and remove all the stuff that's not needed).
  2. Move pegen.c and parse_string.c from cpython/Parser/pegen into cpython/Parser.
  3. Move pegen.h and parse_string.h from cpython/Parser/pegen into cpython/Include.
  4. Generate two new functions that call run_parser_* and can be called from C.
  5. Add a parse.h in cpython/Include that exports both the new generated functions.
  6. Generate the parser into Parser/parse.c.

All the names are just dummy names that I didn't give much thought, but do you think something like this would work?

IndentationErrors are not reported with pegen

❯ ./python.exe -X oldparser
Python 3.9.0a5+ (heads/pegen-dirty:04062ed231, Apr 20 2020, 23:13:24)
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def f():
...    ...
...      ...
  File "<stdin>", line 3
    ...
    ^
IndentationError: unexpected indent
>>> ^D
❯ ./python.exe
Python 3.9.0a5+ (heads/pegen-dirty:04062ed231, Apr 20 2020, 23:13:24)
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def f():
...    ...
...      ...
  File "<stdin>", line 3
    ...
    ^
SyntaxError: invalid syntax

Failure in test_fstring: SyntaxError messages are different

There are two fstring-related types of SyntaxErrors, where pegen produces a different kind of error message:

  1. Invalid string prefixes: For invalid string prefixes we output invalid syntax, while the old parser outputs unexpected EOF while parsing, which is arguably wrong.

  2. For lambdas inside fstring expressions it's exactly the same. Pegen outputs invalid syntax, while the old parser prints unexpected EOF while parsing, which again is arguably wrong.

For some reason, I can't reproduce these error messages in an interactive session, but the tests pass with the old parser and fail with pegen, both on my systam and on Github Actions.

Throw SyntaxError instead of ValueError or UnicodeDecodeError

Current parser:

╰─ ./python       
Python 3.9.0a5+ (heads/errors:df51e29ce4, Apr  5 2020, 21:29:10) 
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> eval(r""" b'\x0' """)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
SyntaxError: (value error) invalid \x escape at position 0
>>> eval(r"f'\N'")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: malformed \N character escape

Pegen:

╰─ ./python -p new                                 
Python 3.9.0a5+ (heads/errors:df51e29ce4, Apr  5 2020, 21:29:10) 
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> eval(r""" b'\x0' """)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid \x escape at position 0
>>> eval(r"f'\N'")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 0-1: malformed \N character escape

Do we want to do that?

/usr/bin/time does not have a -l option

Running make time in Tools/peg_generator on my Linux machine errors, because /usr/bin/time does not have a -l option. After a quick search, I didn't really find what that's supposed to do, in order to fix it.

Failure in test_string_literals: Unexpected amount of warnings

Current Parser:

╰─ ./python       
Python 3.9.0a5+ (heads/pegen:99a8e2fa08, Apr  9 2020, 16:19:04) 
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import warnings
>>> with warnings.catch_warnings(record=True) as w:
...     warnings.simplefilter('always', category=DeprecationWarning)
...     eval("'''\n\\z'''")
... 
'\n\\z'
>>> len(w)
1

Pegen:

╰─ ./python -p new
Python 3.9.0a5+ (heads/pegen:99a8e2fa08, Apr  9 2020, 16:19:04) 
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import warnings
>>> with warnings.catch_warnings(record=True) as w:
...     warnings.simplefilter('always', category=DeprecationWarning)
...     eval("'''\n\\z'''")
... 
'\n\\z'
>>> len(w)
2

`SyntaxError: Keyword argument repeated` does not get thrown by pegen

When, for example, parsing f(1, x=2, *(3, 4), x=5) the behaviour of the current parse is this:

>>> t = 'f(1, x=2, *(3, 4), x=5)'
>>> ast.parse(t)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lysnikolaou/Repositories/pegen-cpython/Lib/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,
  File "<unknown>", line 1
SyntaxError: keyword argument repeated

Pegen currently accepts this.

I don't think this should be a SyntaxError at all, but it is and there is a test that tests it as well in test_grammar.py.

Multiline single statement is allowed

_peg_parser.parse_string('hello\nworld', mode='single') currently succeeds with pegen. The current parser checks for multiline single statements in parsetok.c. We probably need to somehow do so as well.

New decorator syntax

PEP 614 relaxed the grammar on decorators -- you can now use any expression (even walrus) after the @ sign. It should be a simple fix.

test_peg_parser should always use both parsers

Now that we use pegen by default, test_peg_parser is not using the old parser to compare the produced AST trees (because ast.parse also uses the new parser). We need to make sure that the test always compares both parsers against each other (maybe exposing the old parser also in the new _peg_parser module).

Set up CI so that PRs in this repo actually run CI

And merges into the 'pegen' branch should also trigger CI.

Only a basic test run is required -- probably just the standard test suite on macOS, Windows and Linux?

FWIW I have no idea how to do this.

Failure in test_traceback: Different column pointer upon parsing error

Current Parser:

╰─ ./python
Python 3.9.0a5+ (heads/pegen:99a8e2fa08, Apr  9 2020, 16:19:04) 
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> x = 5 | 4 |
  File "<stdin>", line 1
    x = 5 | 4 |
              ^
SyntaxError: invalid syntax

Pegen:

╰─ ./python -p new
Python 3.9.0a5+ (heads/pegen:99a8e2fa08, Apr  9 2020, 16:19:04) 
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> x = 5 | 4 |
  File "<stdin>", line 1
    x = 5 | 4 |
               ^
SyntaxError: invalid syntax

Difference by one. Would it be okay to just skip this test if pegen is enabled?

Do we need the peg_parser module?

If the new parser is going to be the default (and therefore ast.parse and compile) will use it, do we really need the peg_parser module?

I am asking because I was going to make it compile on Windows, but if we are not going to keep it around, it does not make sense to make the effort.

Property-based testing for the parser

Hi everyone!

I gave a talk about property-based testing at the Language Summit yesterday, and people seemed interested. Since it's particularly useful when you have two implementations that can be compared, I thought I should drop by and offer to help apply it to help test the new parser 🙂

We've tested against the stdlib and many popular packages - why might this find additional bugs?

Popular and well-written code may have zero examples of many weird syntactic constructs, since it's typically designed for clarity and compatibility. In my demo here, I discovered several strings that can be compiled but not tokenize.tokenized - involving odd use of newlines, backslashes, non-ascii characters, etc.

It's also nice to automatically get minimal examples of any failing inputs.

Say we decide to use this. What's the catch?

I'll assume here that you're familiar with the basic idea of Hypothesis (e.g. saw my talk) - if not, the example above should be helpful and I've collected all the links you could want here: https://github.com/Zac-HD/stdlib-property-tests#further-reading

I've already written a tool to generate Python source code from 3.8's grammar, called hypothesmith - it's a standard strategy, which generates strings which can be passed to the compile builtin (as a convenient predicate for "is this actually valid source code").

hypothesmith is implemented from a reasonably good EBNF grammar, and "inverting" it using Hypothesis' built-in support for such cases (and the st.from_regex() strategy for terminals). The catch is that, for exactly the same reasons as PEP617 exists, I have a bunch of hacks where I filter substrings using the compile builtin - and so there's no way to generate anything not accepted by that function. Pragmatically, I think it's still worth using, since it's pretty quick to set up!

Extensions after vanilla hypothesmith is running against the parser

First, updating hypothesmith to treat the PEG grammar as authoritative is a no-brainer - I'm aiming to finish that in the next week or two. This would remove the "compile is trusted" limitation entirely. It also shouldn't be too hard - strategies are literally parser combinators, and to generate a string we just "run it backwards": start at the root, and choose random productions at each branch or generate random terminals, writing out the string that would drive such transitions.

Second, Hypothesis tests can be used as fuzz targets for coverage-guided fuzzers such as AFL or libfuzzer. Since we can also generate a wide variety of inputs from scratch, this can be a really powerful technique for ratcheting up our coverage of really weird edge cases. Or just hypothesis.target(...) things like the length of the generated code, the number of ast nodes, number of node types, nodes-per-length, etc.

Please be very precise about what you suggest

  1. Install Hypothesis and Hypothesmith
  2. Use @hypothesis.given(hypothesmith.from_grammar()) to generate inputs to a copy of your existing tests which run against stdlib / pypi / other python modules.
  3. Fix any bugs you find, tell me what worked and/or what didn't.

Something I didn't think of.

You probably have more questions, and I hope I haven't been too presumptuous opening this issue. Please ask, and I'm happy to answer and help out (including reviewing and/or writing code) however I can, subject to other commitments and my Australian timezone!

Finally, thanks for all your work on this project - I'm really looking forward to what a precise and complete grammar will do for the core and wider Python community.

Failure in test_compile: Unexpected LookupError

Current Parser:

╰─ ./python       
Python 3.9.0a5+ (heads/pegen:99a8e2fa08, Apr  9 2020, 16:19:04) 
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> code = '# -*- coding: badencoding -*-\n"\xc2\xa4"\n'
>>> eval(code)
'¤'

Pegen:

╰─ ./python -p new
Python 3.9.0a5+ (heads/pegen:99a8e2fa08, Apr  9 2020, 16:19:04) 
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> code = '# -*- coding: badencoding -*-\n"\xc2\xa4"\n'
>>> eval(code)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: badencoding

Symtable file name is not preserved

From the test suite after #96 is applied:

======================================================================
FAIL: test_filename_correct (test.test_symtable.SymtableTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pgalindo3/github/python/pegen/Lib/test/test_symtable.py", line 194, in checkfilename
    symtable.symtable(brokencode, "spam", "exec")
  File "/Users/pgalindo3/github/python/pegen/Lib/symtable.py", line 13, in symtable
    top = _symtable.symtable(code, filename, compile_type)
  File "<string>", line 1
    def f(x): foo)(
                 ^
SyntaxError: unmatched ')'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/pgalindo3/github/python/pegen/Lib/test/test_symtable.py", line 201, in test_filename_correct
    checkfilename("def f(x): foo)(", 14)  # parse-time
  File "/Users/pgalindo3/github/python/pegen/Lib/test/test_symtable.py", line 196, in checkfilename
    self.assertEqual(e.filename, "spam")
AssertionError: '<string>' != 'spam'
- <string>
+ spam

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.