dervall / piglet Goto Github PK

View Code? Open in Web Editor NEW

91.0 91.0 11.0 1.33 MB

An easier to use parsing and lexing tool that is configurable in fluent code without a pre-build step.

Home Page: binarysculpting.com

License: MIT License

C# 100.00%

piglet's People

Contributors

Stargazers

Watchers

Forkers

maddiem4 pheiberg thenextstep stevehobbsdev wawei bobbyzhu unknown6656 arlm harrison314 omyworld alwag85

piglet's Issues

Disable exception handling option

Give the user an option to instead of causing exceptions when parsing fails, return default(T) instead. Suitable for applications like syntax highlighting where you want the parser to be fast and ignore things that are bad.

Make the regular expressions compatible with the (? syntax

While piglet will NOT support capture groups (simply because there is no reason to do so in a lexer) it would be good idea to not choke on the very common syntax of (? to make a non-capturing group. In essence, we want to ignore this syntax since the behaviour it gives is exactly the same as our default behaviour.

There might also be more of these syntax tricks that we can make the regex engine "compatible" with so that it becomes easier to copy-paste expressions from the web.

Regex multimatcher

It would cover a nice niche use case if we could have a function that takes a number of regular expressions and returns true for each of the ones that matches (or calls an appropriate callback). This is a case where you normally would use multiple regular expressions and test each of them in turn and is a subset of normal lexer operation

In essence;

Make a variant of the lexer that reports all matches when run to the end of an input stream instead of only the highest ranking token.
Make some Multimatch class that takes an array of regular expressions and delegates and makes a lexer. This should have a function "Match" that feeds an input string and does callbacks for all the tokens that matches.

Parser problem with a simple Fluent configuration

In an effort to learn Piglet (which is excellent, by the way - nice work!), I'm trying to construct a parser based on the Turtle sample. I've come to a point where I want to start declaring and parsing variables, but if I try and work with anything that isn't a quoted string, it fails.

I've included the source that I'm working with below.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Piglet.Parser;

namespace Piglet2
{
    class Car
    {
        public string Name { get; set; }

        public void Move(double distance)
        {
            Console.WriteLine("Moving {0}", distance);
        }

        public void Rotate(double angle)
        {
            Console.WriteLine("Rotating {0}", angle);
        }

        public void EngineOff()
        {
            Console.WriteLine("Turning engine off");
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            var variables = new Dictionary<string, object>();

            var input = @"
                car NewCar
                var somevar = ""My car""
                var x = 10
                rotate 20
                move 50.2
                rotate 20.7
                engineoff";

            var car = new Car();

            var config = ParserFactory.Fluent();
            var program = config.Rule();
            var nameDecl = config.Rule();
            var statements = config.Rule();
            var statement = config.Rule();
            var variableDecl = config.Rule();

            var value = config.Rule();

            var symbol = config.Expression();
            symbol.ThatMatches(@"([a-zA-Z]+)").AndReturns(f => f);

            program.IsMadeUp.By(nameDecl).Followed.By(statements);

            statements.IsMadeUp.ByListOf(statement);

            statement.IsMadeUp
                .By(variableDecl)
                .Or.By("engineoff").WhenFound(f =>
                {
                    car.EngineOff();
                    return null;
                })
                .Or.By("move").Followed.By<double>().As("Distance").WhenFound(f =>
                {
                    car.Move(f.Distance);
                    return null;
                })
                .Or.By("rotate").Followed.By<double>().As("Distance").WhenFound(f =>
                {
                    car.Rotate(f.Distance);
                    return null;
                });

            variableDecl.IsMadeUp.By("var")
                .Followed.By(symbol).As("Name")
                .Followed.By("=")
                .Followed.By(value).As("Value")
                .WhenFound(f =>
                {
                    variables[f.Name] = f.Value;
                    return null;
                });

            nameDecl.IsMadeUp.By(@"car")
                .Followed.By(symbol).As("ProgramName")
                .WhenFound(f => car.Name = f.ProgramName);

            value.IsMadeUp
                .By(config.QuotedString)
                .Or.By<int>();

            var parser = config.CreateParser();

            try
            {
                parser.Parse(input);
            }
            catch (ParseException ex)
            {
                throw;
            }

            Console.WriteLine("Parsed program {0}", car.Name);
            Console.WriteLine("Variables found:");

            foreach (var item in variables)
            {
                Console.WriteLine("{0} = {1}", item.Key, item.Value);
            }

            Console.ReadKey();
        }
    }
}

As you can see, I've specified that a 'value' is made up by a quoted string or an integer. The crux of the issue when parsing this program, is that if I include the line "var x = 10" in my input, the parser fails with this exception:

Illegal token \d+(\.\d+)?. Expected "(\\.|[^"])*",\d+

If I take it out, the input gets parsed successfully. I've tried this with booleans (both using .By() and also by specifying a custom boolean expression) and that fails too.

Thanks for taking a look - just wondering if I'm missing something obvious!

Add the Piglet icon to the NuGet package

Since the project has an icon now, it should be in the nuget package file. However one does that, I really don't have a clue.

Add XML documentation to all public methods

Piglet builds with a lot of warnings since there are no XML doc comments on every public member. This is not acceptable and must be fixed!

\w should match all letters in unicode set, not just A-Za-z

\w has unexpected behaviour where it only matches the english non accented letters. The standard MS implementation which people expect will match all letters. This should be changed

Improve performance of NFA lexing

NFA lexer construction is incredibly fast, but the execution is painfully slow. This is due to the closure algorithm being executed lots of times during the lexing. This could be avoided.

Create some tests that shows the poor performance of the runtime

Cache closure results for given state sets.

Probably improve the character range searches as well

Provide an alternative configuration option using strings

It may be more familiar to people already using BNF files parseable to tools such as yacc and bison to be able to use this syntax as Piglet input. Investigate if this is possible.

it might also be a good idea to use Piglet itself to parse the input .y strings, since it will no doubt expose any pre-existing problems in Piglet, if it hasn't found any use before implementing this feature.

Icon for Piglet

Piglet could use an icon for website use and NuGet use. Preferably of a Piglet and using some sort of vectorized graphics so we could scale up and down as neccessary.

Maybe the Piglet could be reading? Wearing glasses?

I feel this is of importance, having a graphical look would improve the professionalism of the library a lot. Especially in nuget package listings where having an icon is a substantial improvement of the default one.

Token associativity

Tokens in the grammar should be able to set their associativity in a similar fashion to what bison allows. When defining a nonterminal, there should be three functions available:

var nonTerminal = configurator.NonTerminal();
// Configure productions
nonTerminal.LeftAssociative();
nonTerminal.RightAssociative();
nonTerminal.NonAssociative();

Prioritize constants in rule over expressions

Somewhat counter-intuitively when making parser the expressions declared to use regular expressions are of a lower priority than string constants declared as part of the rules. This almost always bites you in the ass when you make a rule and is the source of some confusion.

You almost always want the constants to take precedence over expressions. It would simplify things greatly.

This will be a somewhat breaking change, but to most people it will not be a great one, since if the keywords are already separated the result will stay the same.

Performance issues when having large charranges in many expressions

When using exceptionally large character range sets \w in particular, constructing the parser can be quite slow. See if there can be any performance improvements to make here.

The main problem lies in excessive comparison with CharRange that occurs when distinguishing the ranges for DFA construction when minimizing the DFA and constructing the DFA from the NFA.

It's not a runtime problem, only the construction is excessively expensive.

Fluent configuration error reporting

Similary to the tech configuration, it should be possible to handle errors with the fluent parser.

Something like having an Error function chained in between that takes a func somewhere. We can probably map the exception an "Exception" member in the expandoobject, which means we can use the same WhenFound expression. It will only need an error token somewhere

Add contextual object for parsing

It would be good to have an option to use a contextual object when parsing. In this case, you would execute the Parse function together with a context object of a chosen type.

At the moment, if you need access to something while doing the parsing you'd need to capture it in a closure when creating the parser which encourages the bad practice of creating the parser anew every time you need to use it.

This could be applicable to symbol tables to determine what parse tree to output among other things. An overload on Configure that accepts another type and returns a new parser constructor which takes reduction functions with two parameters would be nice. This should be fully backwards compatible, the current functions should work as is.

Change the fluent configuration for expressions

The fluent configuration for expressions should be changed. It's very clunky and turned out generally bad. I intend to do this once Snout is actually operational so that I can leverage the functionality there to avoid doing the pretty cumbersome work manually

Fluent configuration for associativity

It should be possible to set token associativity using the fluent configuration.

Add start conditions for custom lexing

It's currently impossible in a nice way to make the lexer ignore a C-style comment. What is needed is a solution that allows the user to bypass the normal lexing rules, taking control of the input directly when a given token is found. This should enable the user to return both a token and to simply eat the incoming string straight up.

Something like this in flex

"/*" {
    for (;;) {
        while ((c = input()) != '*' && c != EOF)
            ; /* eat up text of comment */
        if (c == '*') {
            while ((c = input()) == '*')
                ;
            if (c == '/')
                break; /* found the end */
        }
        if (c == EOF) {
            error ("EOF in comment");
            break;
        }
    }
}

Perhaps only for ignored things, but it could also enable an ambitious user to get some sort of context-sensitive token recognition as well by lexer hacking.

Context dependent precedence

Implement context dependent precedence, similar to what is found in bison.

Automate NuGet package making

So far the nuget packages have been made totally manually by the package explorer. This feels sort of wrong, and should probably be improved to be a script file to make a complete set.

Add convenience functions for reduce functions

A very common case is that you want to reduce to one of the input arguments, or very often the first

To the IProduction interface add the following methods:

ReduceToFirst() -> SetReduceFunction(x => x[0])
ReduceTo(int n) -> SetReduceFunction(x => x[n])

Should make the entire non-fluent configuration much more readable. Similar convienience functions already exists for the fluent interface where single part rules are automatically reduced to their single components return value.

Context dependent precedence in fluent configuration

Make the fluent configuration support a context dependent precedence in much the same manner as the other configuration option.

Something like .UsingPrecedenceOf(symbol or string) would do the trick.

Possible parsing bug? ('%EOF% expected')

I am writing a simple parser for number sequences and written English numbers, e.g. "one-hundred-and-twenty-five" --> 125.

The grammar for number sequences should look as follows:

digit := "zero"
       |  ...
       | "nine"

sequence := digit
          | digit "-" sequence

My code looks as follows:

private static readonly Dictionary<decimal, string> TrivialDigitRegex = new Dictionary<decimal, string>
{
    [0] = "zero",
    [1] = "one",
    [2] = "two",
    [3] = "three",
    [4] = "four",
    [5] = "five",
    [6] = "six",
    [7] = "seven",
    [8] = "eight",
    [9] = "nine",
};


IParserConfigurator<decimal> conf = ParserFactory.Configure<decimal>();
INonTerminal<decimal> digit = conf.CreateNonTerminal();
            
foreach (var elem in from kvp in TrivialDigitRegex
                     select conf.CreateTerminal(kvp.Value, s => kvp.Key))
    digit.AddProduction(elem).SetReduceToFirst();

ITerminal<decimal> seperator = conf.CreateTerminal("-");
INonTerminal<decimal> sequence = conf.CreateNonTerminal();

sequence.AddProduction(digit, seperator, sequence).SetReduceFunction(a => a[0] * 10 + a[2]);
sequence.AddProduction(digit).SetReduceToFirst();
            
var parser = conf.CreateParser();
var result = parser.Parse("three-one-five");

It fails, however, on the last line with the following exception:

Piglet.Parser.ParseException occurred
HResult=0x80131500
Message=Illegal token -. Expected %EOF%
Source=Piglet
StackTrace:
at Piglet.Parser.LRParser1.Parse(ILexerInstance1 lexerInstance) in ....../LRParser.cs:line 90
at Piglet.Parser.LRParser`1.Parse(String input) in ....../LRParser.cs:line 158

It shall be noted that the execution of the following line works flawlessly:

var result = parser.Parse("three"); // returns 3

Is it a possible parsing bug or am I missing something important? As far as I can see my grammar definition should be correct (leaving aside the fact, that the grammar's 'entry point' should be sequence and not digit, but I have not been able to find out how to set the parser's start symbol without reflection).

Provide lexer state in reduce function callback

I am building a compiler in which semantic analysis will be done a separate pass, after the AST has been built. But in order to report line numbers for semantic errors, I need to associate AST nodes with their corresponding token position.

The token position is currently only provided (to SetErrorFunction) if a parse error occurs. But perhaps there could be another overload of SetReduceFunction, where the callback is given the lexer state as well as the list of objects to reduce?

Move fluent configuration unto Snout

Snouts primary purpose is to generate the fluent configuration for Piglet. We should use this functionality to avoid the nasty work of configuring this ourselves.

Demos

Piglet should come with a few very clearly written demos for parsing common tasks. Ideally this should be something that is pretty close to what people will actually use Piglet for in the end, while still remaining non-convuluted enough to get the point across. Ideas for this are:

A basic XML parser
A JSON parser
Some sort of logfile parser

Lexer demos are simpler by nature.

Perhaps this could be a separate project in the solution? The demos could then also serve as integration tests of a sort.

Support for error recovery in fluent configuration

The same sort of error reporting that can be found in the tech configuration option should be available in the fluent configuration

Non-tabular lexer option

In certain cases the user might not want a tabular lexer but instead run with the in-memory state machine as it stands. This functionality got implemented with the improved debug output.

When constructing a lexer give the user the option to run either the:

NFA. This is the fastest to construct but the slowest to run
DFA: This is a medium way. Takes longer to construct and is quite fast to run
Table: Super-fast to run but also slow to construct.

Improved error reporting from parser

When defining a parser you should be able to output specific errors in the productions, instead of generic error messages
The errors should probably be handled similarly to what other established tools do.
Handling an error should make the parser continue, so that parsing can continue using a suitable panic mode (ignore tokens until a given token has been parsed)

Maybe an appropriate way would be to use a Func that returns a boolean if the error has been handled?

Tabular unicode lexing support

The lexer should work in tabular form with unicode. This will give a slight performance hit, but not too bad.

Refactor the transition tables to work with character ranges instead of plain characters.

Regular expression lookahead

Piglets regular expression engine cannot handle lookaheads and lookarounds. This a feature that we would want to have in the end.

Auto-generate the fluent interface

It should be possible to use piglets tech interface to actually generate the fluent interface instead of the hard-coded silly way that it's being done. This should make things pretty damn maintainable in the long run.

This is being worked on in the FluentCompile branch.

Minimize DFA

No DFA minimization algorithm is currently run on the DFA. This should be performed for performance and memory reasons.

Handle '&' correctly in regexp

Piglets regular expression cannot currently correctly handle & since it has a dual use as a concatenation operator in the postfix variant of the regular expression. This is a pretty silly restriction, and the postfix converter should probably be reworked to remove this restriction.

In all honesty, the postfix converter could avoid building the postfix string alltogether and simply call the appropriate function in NFA to build the state machine.

Use a faster search for finding the correct char range

There's a piece of code in tabular lexer which searches the list of ranges in a linear fashion to find the correct range to put a char in. This is stupid since the ranges are already sorted and we should be able to do a binary search to find the correct range to put the char in.

This should also go for the NFA and DFA lexers, and should improve the runtime speed of the lexing and thus the entire parser.

Code generation for prebuilt parsers

There is no reason why Piglet shouldn't be able to generate code just as the other big tools do. I would suggest generating just the compressed tables, and providing ParserFactory with another loader method to load tables from a separate class.

The main issue is how to provide a nice access to the lambdas that are used for OnReduce and OnParse actions. An option would be to generate this using a similar syntax as is used in bison ($$, $1 and so on)

Improved error message for unsupported regular expression features

When using things such as backreferences and lookahead and lookbehind and look-in-some-place-piglet-does-not-support the error message should be appropriate given the situation. A specialized UnsupportedXXX should be thrown to highlight to the user that the expression IS valid, but Piglet does not or can not support this feature.

Improve speed of NFA closure algorithm

The NFA closure algorithm is very slow right now, since it is completely unoptimized and looks through the closures again and again, even though they never change. Change this to precomputing the closures for all nodes in the grammar ONCE and only refer to the precomputed closurse when making the DFA

Build on Mono

I have not, and to the best of my knowledge noone has, tried to build and run Piglet in Mono.

There should be nothing stopping this from being done as far as I know, and it would expand the audience for Piglet further.

Test and build Piglet using Mono.
Supply adequate files for source building and whatever binaries a mono user would use.

Change test framework to NUnit

MSTest is causing problems with the automated build system. Change the test framework to NUnit instead to make things bette.r

Numbered repetition is not implemented to standards

When doing a regular expression the numbered repetition does not work as you intend it to. The separator between {} should be , and not : as it stands currently.

Furthermore, I believe that the implementation is not working as standard when using only a single value. I believe it will specify only a minimum number of repetition when using a single digit - and most implementations specify that it should be a maximum limit. Investigate and correct.

If there is a difference use the implementation by Microsoft, since that is what will be expected from the users coming from a .NET background.

Unknown escaped character '^'

In code

static int Exp(int a , int x)
{
    return Enumerable.Range(0, x).Aggregate(1, (acc, i) => acc * a);
}

var configurator = ParserFactory.Configure<int>();

ITerminal<int> number = configurator.CreateTerminal("\\d+", t => int.Parse(t, System.Globalization.CultureInfo.InvariantCulture));


INonTerminal<int> expr = configurator.CreateNonTerminal();
INonTerminal<int> term = configurator.CreateNonTerminal();
INonTerminal<int> factor = configurator.CreateNonTerminal();
INonTerminal<int> expexpr = configurator.CreateNonTerminal();

expr.AddProduction(expr, "+", term).SetReduceFunction(s => s[0] + s[2]);
expr.AddProduction(expr, "-", term).SetReduceFunction(s => s[0] - s[2]);
expr.AddProduction(term).SetReduceFunction(s => s[0]);

term.AddProduction(term, "*", expexpr).SetReduceFunction(s => s[0] * s[2]);
term.AddProduction(term, "/", expexpr).SetReduceFunction(s => s[0] / s[2]);
term.AddProduction(expexpr).SetReduceFunction(s => s[0]);

expexpr.AddProduction(expexpr, "^", factor).SetReduceFunction(s => Exp(s[0], s[2]));
// expexpr.AddProduction(expexpr, "\\^", factor).SetReduceFunction(s => Exp(s[0], s[2]));
expexpr.AddProduction(factor).SetReduceFunction(s => s[0]);

factor.AddProduction(number).SetReduceFunction(s => s[0]);
factor.AddProduction("(", expr, ")").SetReduceFunction(s => s[1]);

var parser = configurator.CreateParser();

var value = parser.Parse("3^4 + 1");

obtrain error:

Piglet.Lexer.Construction.LexerConstructionException: 'Unknown escaped character '^''

for ^ or \\^.

Is there a mistake in my code?

Lexer and parser thread safety

Parsers and lexers are reentrant but currently not thread safe. This should be fixed for the next release. Pretty important too, will fix this very soon.

This is likely to be a breaking fix for directly used lexers, but the parser should be fine since it already contains just one Parse() method.

Remove all state from the parser and lexer class. Put this in separate state classes

Left over console output in LRParser error recovery

Due to stupidity I have left debug console output in the LRParser which should not be there. This should be removed

Lexer unicode support

The lexer should include unicode support. I'm unsure how this might be best accomplished, I guess one solution could be to handle unicode characters as character combinations, i.e. two normal characters in a row.

Handle escaped ? in regex

The ? is not handled correctly when escaped.

Multiple instances of Lexer running at a given time

When creating a lexer you cannot run more than one source through it at a given time. Since the source is set the lexer is a stateful object that you call next on. The only way to get another lexer that you could run on at the same time would be to make a full lexer even though this is totally unnecessary.

Separate the runtime info from the lexer into a separate class. When setting the input on the lexer, return a runtime bound to the current input. Many of these classes should then be createable and runnable at the same time.

Single optional rule part for fluent configurations

There should be an option to make a single rule part optional when making fluent configurations. Currently you have to make an empty nothingness rule and make that a production of whatever you want to have optional. This is not very intuitive by any means.

We already have something similar when lists are declared. It would be totally cool if this could be made to work on single entities within a rule as well.

Remove DFA dead states

The lexer does not perform a dead state removal. This should be done in order to keep the states at a minimum and conserve memory and processing capabilties.