Git Product home page Git Product logo

Comments (13)

whoozle avatar whoozle commented on June 11, 2024

I think it's known problem of pyparsing I'm afraid. Sometimes it could be improved with using - instead of +, but it depends on the grammar and context. QML grammar is very ambiguous, and I'm 95% sure it couldn't be done with LALR(1) parser.
If you think you can propose any alternative parser library/generator which is able to handle QML grammar, I'd love to replace pyparsing, it's extremely slow.

from qmlcore.

tomager avatar tomager commented on June 11, 2024

The Qt QML parser is generated with an internal tool qlalr [2] which is a bison/yacc-style LALR parser generator. The grammar for their QML is at [1], and is nearly LALR. I think the few bits that aren't context free are handled by parsing a superset and then in the reduction case manually disambiguating and emitting different AST nodes.

They parse the entire JS grammar and add Ui* rules to parse the QML specific bits, which is a bit more involved than the pureqml approach but has its advantages (like being able to not require { } around js blocks, at minimum). Esprima [3] is a python parser for JS that could be modified to follow the same approach. Personally, I would think this would be the way to go to justify the trouble of replacing the parser.

Keeping the existing pureqml approach of having free-form JS code blocks captured in some { } blocks, a really basic LALR parser can parse each QML-syntax block that contains properties etc and emit free-form text AST nodes for any { } sub-blocks. Then the lexer/parser would be re-run recursively as the context of the sub-blocks becomes known (js/free-form or qml). There are a bunch of choices for LALR generators for python [4].

I haven't played with pyparsing, but it seems odd to me that a well known parser package wouldn't be able to deal with correct error location, and also be so slow. On my machine it takes like 20+ seconds to parse qmlcore/core/*.qml which is pretty tragic. Makes me think there is something simple that could be changed to fix at least the performance. I'll take a look at it a bit further when I have more time.

[1] https://raw.githubusercontent.com/qt/qtdeclarative/dev/src/qml/parser/qqmljs.g

[2] https://github.com/qt/qtbase/tree/dev/src/tools/qlalr

[3] https://github.com/Kronuz/esprima-python

[4] https://wiki.python.org/moin/LanguageParsing

from qmlcore.

whoozle avatar whoozle commented on June 11, 2024

a really basic LALR parser can parse each QML-syntax block that contains properties etc and emit free-form text AST nodes for any { } sub-blocks.

I have a port to PLY somewhere, and it's not really as easy and straightforward as it seems :(

The most troublesome bits are { } scopes
It could be transform: { x: , y: } - assigning multiple properties, it could be piece of javascript code which is easy to parse using lexer (because of strings and { } count), it could be ListElement { /*json rules here*/ } inside ListModel or component scope (like in Behaviour/on or Component declaration).

pyparsing literally checks every rule (stopping backtracking at -).

On my machine it takes like 20+ seconds to parse qmlcore/core/*.qml which is pretty tragic.

Agree that it's tragic but it should be cached until compiler grammar has changed! :) You can also improve the situation a bit using -j

With regards to javascript parser, we discussed it before, there's both pros and cons of having js parser.
the biggest pros is that we can construct output in different language from AST (C/C++ for embedded systems is one of possible candidates)
the biggest cons are that you have to integrate qml/js parsers together and resolve ambiguity which is not really straightforward if you use normal shift/reduce automaton, and most likely we won't have enough resources to develop and most importantly support that, even if someone did it for us :)

Thanks for thorough analysis though, I really appreciate it, I'll reiterate parser generators once again and see if there's any suitable tool for that. (Or maybe pyparsing could be sped up? packrat does not help much, maybe some bits of the grammar misses -)

from qmlcore.

tomager avatar tomager commented on June 11, 2024

I don't really understand what you are saying about the grammar being very ambiguous. I spent some time this afternoon rewriting most of the parser as a simple recursive-descent structure, and got it to the point of parsing qmlcore/core/*.qml. It only has to do lookahead by a couple of tokens to disambiguate any declaration from any other. It parses core/*.qml in about 2 seconds on my machine, and in about half a second with -j enabled. The line/column/context of any errors are trivially correct with this form of parser. Also the parser logic itself is much easier to understand, since it's just straight forward python logic, rather than a meta language embedded in python like with pyparsing/PLY.

The changeset is at c0e0bc5. The parser is in a new file compiler/parser.py. It's not finished work, but just to prove the parser methodology-- it stops after parsing (it populates the lang.* AST, but I didn't validate that all the arguments are exactly what is expected), doesn't yet support docstrings, and doesn't yet support some of the syntax that isn't used in core/*.qml. It uses pyparsing for json_object and expression, but those could be moved to recursive descent structure easily enough, and all of pyparsing could be removed.

I'm happy to clean it up, finish it, and submit PR if you agree. It would fix this bad-error-message issue, the speed issues, and be easy to validate with your diff method since the output should be identical to the version with pyparsing.

from qmlcore.

whoozle avatar whoozle commented on June 11, 2024

Good start, I guess it's not handling infix expressions too.

I've noticed you renamed property, we can't do it, unfortunately. What you can do is add property to identifier rule and convert it to string explicitly.

from qmlcore.

tomager avatar tomager commented on June 11, 2024

Thanks-- I'll finish it when I have a chance and submit a PR for review.

Yes, infix expressions are still handled by pyparse in the change I linked (it extracts the literal text for the expression and invokes pyparse on just that string). In the existing parsing, infix expressions and scope notation are both collapsed to strings (with ${var} around references in expressions) which is a bit wonky, but will preserve that for now. Ideally references and expression operations (unary/binary/functional/etc) would be first class AST nodes, which would simplify various handling downstream (like scope searching etc).

I will remove the keywords I added (property enum const alias async function signal) and do those by identifier matching, so it matches the existing behavior.

from qmlcore.

whoozle avatar whoozle commented on June 11, 2024

Have you seen https://www.python.org/dev/peps/pep-0617/ ?

They seem to create parser generator for themselves and this doesn't seem like crazy idea to me. There's no decent generator on the internet, even though language parsing is some sort of "solved" problem. :)
In long term this looks better to me than gluing ply/parsing Frankenstein with hand-written recursive descent parser (even though it's a good short-term solution, I don't want to discourage you in any way!)
(But if you think you can create reliable parser using recursive descent + pyparsing bits for infix/code, I'll accept it, we can drop cache and grammar/lang signing all together)

So what I'd like to do is:

  • generic PEG parser generator using Earley parser (maybe later, no Earley is strictly required for the start)
  • language agnostic, generate rule used in per-language runtime, maybe template-based language (jinja?) backends for a start.
  • semantic per-language actions (defined by rule names). Because of language agnosticism it's not a good idea to include them in grammar, but it could be side loading files, e.g. grammar.peg, actions.cpp, actions.py.
  • infix expression support using Shunting Yard algorithm.

If you're interested of participating in such a project, just let me know, this is something I wanted to do for many years. 👍

from qmlcore.

tomager avatar tomager commented on June 11, 2024

Personally, I'm not really interested in working on a new parser generator. I think the recursive descent parser is the way to go-- it's simple, easy to maintain, and complete. In my day job I own a couple fairly complex custom languages, one implemented as recursive descent, and the other implemented with bison. From experience, I can tell you that maintaining the recursive descent parser is much easier since the code "means what it says", whereas touching the bison parser requires swapping in all the subtleties of the meta language and generator. This is a problem that all meta/generator approaches have.

Also from the point of view of attracting pureqml contributors, it will be harder to find people willing/able to hack on the parser and your custom parser generator, rather than an existing one that is widely used, or a simple recursive descent implementation. Python doesn't have that problem because it's a massive project.

Recursive descent is how gcc, clang, and the esprima JS parser I linked to earlier all work. It's not a fringe approach or anything.

There is no particular difficulty in implementing infix expressions or json blocks with recursive descent. Also ply.lex is basically just a small wrapper around re.compile(r'(token1)|(token2)|(token3)|...'), so both pyparsing and ply.lex can be eliminated without much fuss. It took me about 5 hours to get 80% of the parser rewrite going, probably with 2 more days of push it can be done.

Like I said earlier, personally I think if more time were spent on the parser side, integrating the full JS grammar would be the thing to do. That would allow more complete dependency tracking, and we could unify the syntax with Qt QML. That would also allow future upstreaming to Qt, and make it easier for existing Qt QML people (which are pretty numerous at this point) to use pureqml.

from qmlcore.

whoozle avatar whoozle commented on June 11, 2024

Fair enough. :)

Although recursive descent is the most readable/maintainable, it's also the slowest possible implementation.
I think most of the real world parsers, (including gcc's) have more efficient algorithm for infix expressions, like shunting yard, precedence climbing or Pratt's parsing, or you will end up having recursive rule per each precedence level, like it was in old yacc

x1 : x2 | x2
x2 : x3 & x3
x3 : x4 + x4 | x4 - x4
x4: x5 * x5 | x5 / x5

etc etc

Although in pureqml we prolly don't need it at all, as we collapse it back to string 😉

from qmlcore.

tomager avatar tomager commented on June 11, 2024

Yes, I agree. Often the generator approach (at least bison/ply) will be much faster than anything hand-rolled, and in gcc/clang there is quite a lot of custom machinery to get the performance they do.

For the infix expression parsing, since the expression gets immediately lowered back to string, I'm not sure it needs bother with operator precedence at all (or very minimally just for parens and brackets). It just needs to recover the context of various rvalue terms (eg number/enum/function/property/...), in order to do the ${} substitution rewrites required downstream.

from qmlcore.

whoozle avatar whoozle commented on June 11, 2024

I've implemented new RD parser with TDOP bit, parses any project from scratch in 3s or so. You can pass -E to enable it. I tested it on some larger projects, working really good, although not yet ready to be a default.

Error messages are very verbose:

        property array h: [1 ** 2, a, b;
                                       ^-- Expected ',' as an array delimiter

from qmlcore.

tomager avatar tomager commented on June 11, 2024

This is awesome! Having nice error messages with at least correct location is a big deal.

Thanks for doing this work and sorry I didn't have more time. I will enable -E for our project and start using it.

from qmlcore.

whoozle avatar whoozle commented on June 11, 2024

we switched to new parser recently and apparently our current error output is pretty close to ideal, closing it for now, if you have any further questions/suggestions - feel free to open new/reopen this issue. :)

from qmlcore.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.