dervall / piglet Goto Github PK
View Code? Open in Web Editor NEWAn easier to use parsing and lexing tool that is configurable in fluent code without a pre-build step.
Home Page: binarysculpting.com
License: MIT License
An easier to use parsing and lexing tool that is configurable in fluent code without a pre-build step.
Home Page: binarysculpting.com
License: MIT License
Give the user an option to instead of causing exceptions when parsing fails, return default(T) instead. Suitable for applications like syntax highlighting where you want the parser to be fast and ignore things that are bad.
While piglet will NOT support capture groups (simply because there is no reason to do so in a lexer) it would be good idea to not choke on the very common syntax of (? to make a non-capturing group. In essence, we want to ignore this syntax since the behaviour it gives is exactly the same as our default behaviour.
There might also be more of these syntax tricks that we can make the regex engine "compatible" with so that it becomes easier to copy-paste expressions from the web.
It would cover a nice niche use case if we could have a function that takes a number of regular expressions and returns true for each of the ones that matches (or calls an appropriate callback). This is a case where you normally would use multiple regular expressions and test each of them in turn and is a subset of normal lexer operation
In essence;
In an effort to learn Piglet (which is excellent, by the way - nice work!), I'm trying to construct a parser based on the Turtle sample. I've come to a point where I want to start declaring and parsing variables, but if I try and work with anything that isn't a quoted string, it fails.
I've included the source that I'm working with below.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Piglet.Parser;
namespace Piglet2
{
class Car
{
public string Name { get; set; }
public void Move(double distance)
{
Console.WriteLine("Moving {0}", distance);
}
public void Rotate(double angle)
{
Console.WriteLine("Rotating {0}", angle);
}
public void EngineOff()
{
Console.WriteLine("Turning engine off");
}
}
class Program
{
static void Main(string[] args)
{
var variables = new Dictionary<string, object>();
var input = @"
car NewCar
var somevar = ""My car""
var x = 10
rotate 20
move 50.2
rotate 20.7
engineoff";
var car = new Car();
var config = ParserFactory.Fluent();
var program = config.Rule();
var nameDecl = config.Rule();
var statements = config.Rule();
var statement = config.Rule();
var variableDecl = config.Rule();
var value = config.Rule();
var symbol = config.Expression();
symbol.ThatMatches(@"([a-zA-Z]+)").AndReturns(f => f);
program.IsMadeUp.By(nameDecl).Followed.By(statements);
statements.IsMadeUp.ByListOf(statement);
statement.IsMadeUp
.By(variableDecl)
.Or.By("engineoff").WhenFound(f =>
{
car.EngineOff();
return null;
})
.Or.By("move").Followed.By<double>().As("Distance").WhenFound(f =>
{
car.Move(f.Distance);
return null;
})
.Or.By("rotate").Followed.By<double>().As("Distance").WhenFound(f =>
{
car.Rotate(f.Distance);
return null;
});
variableDecl.IsMadeUp.By("var")
.Followed.By(symbol).As("Name")
.Followed.By("=")
.Followed.By(value).As("Value")
.WhenFound(f =>
{
variables[f.Name] = f.Value;
return null;
});
nameDecl.IsMadeUp.By(@"car")
.Followed.By(symbol).As("ProgramName")
.WhenFound(f => car.Name = f.ProgramName);
value.IsMadeUp
.By(config.QuotedString)
.Or.By<int>();
var parser = config.CreateParser();
try
{
parser.Parse(input);
}
catch (ParseException ex)
{
throw;
}
Console.WriteLine("Parsed program {0}", car.Name);
Console.WriteLine("Variables found:");
foreach (var item in variables)
{
Console.WriteLine("{0} = {1}", item.Key, item.Value);
}
Console.ReadKey();
}
}
}
As you can see, I've specified that a 'value' is made up by a quoted string or an integer. The crux of the issue when parsing this program, is that if I include the line "var x = 10" in my input, the parser fails with this exception:
Illegal token \d+(\.\d+)?. Expected "(\\.|[^"])*",\d+
If I take it out, the input gets parsed successfully. I've tried this with booleans (both using .By() and also by specifying a custom boolean expression) and that fails too.
Thanks for taking a look - just wondering if I'm missing something obvious!
Since the project has an icon now, it should be in the nuget package file. However one does that, I really don't have a clue.
Piglet builds with a lot of warnings since there are no XML doc comments on every public member. This is not acceptable and must be fixed!
\w has unexpected behaviour where it only matches the english non accented letters. The standard MS implementation which people expect will match all letters. This should be changed
NFA lexer construction is incredibly fast, but the execution is painfully slow. This is due to the closure algorithm being executed lots of times during the lexing. This could be avoided.
Create some tests that shows the poor performance of the runtime
Cache closure results for given state sets.
Probably improve the character range searches as well
It may be more familiar to people already using BNF files parseable to tools such as yacc and bison to be able to use this syntax as Piglet input. Investigate if this is possible.
it might also be a good idea to use Piglet itself to parse the input .y strings, since it will no doubt expose any pre-existing problems in Piglet, if it hasn't found any use before implementing this feature.
Piglet could use an icon for website use and NuGet use. Preferably of a Piglet and using some sort of vectorized graphics so we could scale up and down as neccessary.
Maybe the Piglet could be reading? Wearing glasses?
I feel this is of importance, having a graphical look would improve the professionalism of the library a lot. Especially in nuget package listings where having an icon is a substantial improvement of the default one.
Tokens in the grammar should be able to set their associativity in a similar fashion to what bison allows. When defining a nonterminal, there should be three functions available:
var nonTerminal = configurator.NonTerminal();
// Configure productions
nonTerminal.LeftAssociative();
nonTerminal.RightAssociative();
nonTerminal.NonAssociative();
Somewhat counter-intuitively when making parser the expressions declared to use regular expressions are of a lower priority than string constants declared as part of the rules. This almost always bites you in the ass when you make a rule and is the source of some confusion.
You almost always want the constants to take precedence over expressions. It would simplify things greatly.
This will be a somewhat breaking change, but to most people it will not be a great one, since if the keywords are already separated the result will stay the same.
When using exceptionally large character range sets \w in particular, constructing the parser can be quite slow. See if there can be any performance improvements to make here.
The main problem lies in excessive comparison with CharRange that occurs when distinguishing the ranges for DFA construction when minimizing the DFA and constructing the DFA from the NFA.
It's not a runtime problem, only the construction is excessively expensive.
Similary to the tech configuration, it should be possible to handle errors with the fluent parser.
Something like having an Error function chained in between that takes a func somewhere. We can probably map the exception an "Exception" member in the expandoobject, which means we can use the same WhenFound expression. It will only need an error token somewhere
It would be good to have an option to use a contextual object when parsing. In this case, you would execute the Parse function together with a context object of a chosen type.
At the moment, if you need access to something while doing the parsing you'd need to capture it in a closure when creating the parser which encourages the bad practice of creating the parser anew every time you need to use it.
This could be applicable to symbol tables to determine what parse tree to output among other things. An overload on Configure that accepts another type and returns a new parser constructor which takes reduction functions with two parameters would be nice. This should be fully backwards compatible, the current functions should work as is.
The fluent configuration for expressions should be changed. It's very clunky and turned out generally bad. I intend to do this once Snout is actually operational so that I can leverage the functionality there to avoid doing the pretty cumbersome work manually
It should be possible to set token associativity using the fluent configuration.
It's currently impossible in a nice way to make the lexer ignore a C-style comment. What is needed is a solution that allows the user to bypass the normal lexing rules, taking control of the input directly when a given token is found. This should enable the user to return both a token and to simply eat the incoming string straight up.
Something like this in flex
"/*" {
for (;;) {
while ((c = input()) != '*' && c != EOF)
; /* eat up text of comment */
if (c == '*') {
while ((c = input()) == '*')
;
if (c == '/')
break; /* found the end */
}
if (c == EOF) {
error ("EOF in comment");
break;
}
}
}
Perhaps only for ignored things, but it could also enable an ambitious user to get some sort of context-sensitive token recognition as well by lexer hacking.
Implement context dependent precedence, similar to what is found in bison.
So far the nuget packages have been made totally manually by the package explorer. This feels sort of wrong, and should probably be improved to be a script file to make a complete set.
A very common case is that you want to reduce to one of the input arguments, or very often the first
To the IProduction interface add the following methods:
ReduceToFirst() -> SetReduceFunction(x => x[0])
ReduceTo(int n) -> SetReduceFunction(x => x[n])
Should make the entire non-fluent configuration much more readable. Similar convienience functions already exists for the fluent interface where single part rules are automatically reduced to their single components return value.
Make the fluent configuration support a context dependent precedence in much the same manner as the other configuration option.
Something like .UsingPrecedenceOf(symbol or string) would do the trick.
I am writing a simple parser for number sequences and written English numbers, e.g. "one-hundred-and-twenty-five" --> 125
.
The grammar for number sequences should look as follows:
digit := "zero"
| ...
| "nine"
sequence := digit
| digit "-" sequence
My code looks as follows:
private static readonly Dictionary<decimal, string> TrivialDigitRegex = new Dictionary<decimal, string>
{
[0] = "zero",
[1] = "one",
[2] = "two",
[3] = "three",
[4] = "four",
[5] = "five",
[6] = "six",
[7] = "seven",
[8] = "eight",
[9] = "nine",
};
IParserConfigurator<decimal> conf = ParserFactory.Configure<decimal>();
INonTerminal<decimal> digit = conf.CreateNonTerminal();
foreach (var elem in from kvp in TrivialDigitRegex
select conf.CreateTerminal(kvp.Value, s => kvp.Key))
digit.AddProduction(elem).SetReduceToFirst();
ITerminal<decimal> seperator = conf.CreateTerminal("-");
INonTerminal<decimal> sequence = conf.CreateNonTerminal();
sequence.AddProduction(digit, seperator, sequence).SetReduceFunction(a => a[0] * 10 + a[2]);
sequence.AddProduction(digit).SetReduceToFirst();
var parser = conf.CreateParser();
var result = parser.Parse("three-one-five");
It fails, however, on the last line with the following exception:
Piglet.Parser.ParseException occurred
HResult=0x80131500
Message=Illegal token -. Expected %EOF%
Source=Piglet
StackTrace:
at Piglet.Parser.LRParser1.Parse(ILexerInstance
1 lexerInstance) in ....../LRParser.cs:line 90
at Piglet.Parser.LRParser`1.Parse(String input) in ....../LRParser.cs:line 158
It shall be noted that the execution of the following line works flawlessly:
var result = parser.Parse("three"); // returns 3
Is it a possible parsing bug or am I missing something important? As far as I can see my grammar definition should be correct (leaving aside the fact, that the grammar's 'entry point' should be sequence
and not digit
, but I have not been able to find out how to set the parser's start symbol without reflection).
I am building a compiler in which semantic analysis will be done a separate pass, after the AST has been built. But in order to report line numbers for semantic errors, I need to associate AST nodes with their corresponding token position.
The token position is currently only provided (to SetErrorFunction) if a parse error occurs. But perhaps there could be another overload of SetReduceFunction, where the callback is given the lexer state as well as the list of objects to reduce?
Snouts primary purpose is to generate the fluent configuration for Piglet. We should use this functionality to avoid the nasty work of configuring this ourselves.
Piglet should come with a few very clearly written demos for parsing common tasks. Ideally this should be something that is pretty close to what people will actually use Piglet for in the end, while still remaining non-convuluted enough to get the point across. Ideas for this are:
Lexer demos are simpler by nature.
Perhaps this could be a separate project in the solution? The demos could then also serve as integration tests of a sort.
The same sort of error reporting that can be found in the tech configuration option should be available in the fluent configuration
In certain cases the user might not want a tabular lexer but instead run with the in-memory state machine as it stands. This functionality got implemented with the improved debug output.
When constructing a lexer give the user the option to run either the:
Maybe an appropriate way would be to use a Func that returns a boolean if the error has been handled?
The lexer should work in tabular form with unicode. This will give a slight performance hit, but not too bad.
Piglets regular expression engine cannot handle lookaheads and lookarounds. This a feature that we would want to have in the end.
It should be possible to use piglets tech interface to actually generate the fluent interface instead of the hard-coded silly way that it's being done. This should make things pretty damn maintainable in the long run.
This is being worked on in the FluentCompile branch.
No DFA minimization algorithm is currently run on the DFA. This should be performed for performance and memory reasons.
Piglets regular expression cannot currently correctly handle & since it has a dual use as a concatenation operator in the postfix variant of the regular expression. This is a pretty silly restriction, and the postfix converter should probably be reworked to remove this restriction.
In all honesty, the postfix converter could avoid building the postfix string alltogether and simply call the appropriate function in NFA to build the state machine.
There's a piece of code in tabular lexer which searches the list of ranges in a linear fashion to find the correct range to put a char in. This is stupid since the ranges are already sorted and we should be able to do a binary search to find the correct range to put the char in.
This should also go for the NFA and DFA lexers, and should improve the runtime speed of the lexing and thus the entire parser.
There is no reason why Piglet shouldn't be able to generate code just as the other big tools do. I would suggest generating just the compressed tables, and providing ParserFactory with another loader method to load tables from a separate class.
The main issue is how to provide a nice access to the lambdas that are used for OnReduce and OnParse actions. An option would be to generate this using a similar syntax as is used in bison ($$, $1 and so on)
When using things such as backreferences and lookahead and lookbehind and look-in-some-place-piglet-does-not-support the error message should be appropriate given the situation. A specialized UnsupportedXXX should be thrown to highlight to the user that the expression IS valid, but Piglet does not or can not support this feature.
The NFA closure algorithm is very slow right now, since it is completely unoptimized and looks through the closures again and again, even though they never change. Change this to precomputing the closures for all nodes in the grammar ONCE and only refer to the precomputed closurse when making the DFA
I have not, and to the best of my knowledge noone has, tried to build and run Piglet in Mono.
There should be nothing stopping this from being done as far as I know, and it would expand the audience for Piglet further.
MSTest is causing problems with the automated build system. Change the test framework to NUnit instead to make things bette.r
When doing a regular expression the numbered repetition does not work as you intend it to. The separator between {} should be , and not : as it stands currently.
Furthermore, I believe that the implementation is not working as standard when using only a single value. I believe it will specify only a minimum number of repetition when using a single digit - and most implementations specify that it should be a maximum limit. Investigate and correct.
If there is a difference use the implementation by Microsoft, since that is what will be expected from the users coming from a .NET background.
In code
static int Exp(int a , int x)
{
return Enumerable.Range(0, x).Aggregate(1, (acc, i) => acc * a);
}
var configurator = ParserFactory.Configure<int>();
ITerminal<int> number = configurator.CreateTerminal("\\d+", t => int.Parse(t, System.Globalization.CultureInfo.InvariantCulture));
INonTerminal<int> expr = configurator.CreateNonTerminal();
INonTerminal<int> term = configurator.CreateNonTerminal();
INonTerminal<int> factor = configurator.CreateNonTerminal();
INonTerminal<int> expexpr = configurator.CreateNonTerminal();
expr.AddProduction(expr, "+", term).SetReduceFunction(s => s[0] + s[2]);
expr.AddProduction(expr, "-", term).SetReduceFunction(s => s[0] - s[2]);
expr.AddProduction(term).SetReduceFunction(s => s[0]);
term.AddProduction(term, "*", expexpr).SetReduceFunction(s => s[0] * s[2]);
term.AddProduction(term, "/", expexpr).SetReduceFunction(s => s[0] / s[2]);
term.AddProduction(expexpr).SetReduceFunction(s => s[0]);
expexpr.AddProduction(expexpr, "^", factor).SetReduceFunction(s => Exp(s[0], s[2]));
// expexpr.AddProduction(expexpr, "\\^", factor).SetReduceFunction(s => Exp(s[0], s[2]));
expexpr.AddProduction(factor).SetReduceFunction(s => s[0]);
factor.AddProduction(number).SetReduceFunction(s => s[0]);
factor.AddProduction("(", expr, ")").SetReduceFunction(s => s[1]);
var parser = configurator.CreateParser();
var value = parser.Parse("3^4 + 1");
obtrain error:
Piglet.Lexer.Construction.LexerConstructionException: 'Unknown escaped character '^''
for ^
or \\^
.
Is there a mistake in my code?
Parsers and lexers are reentrant but currently not thread safe. This should be fixed for the next release. Pretty important too, will fix this very soon.
This is likely to be a breaking fix for directly used lexers, but the parser should be fine since it already contains just one Parse() method.
Due to stupidity I have left debug console output in the LRParser which should not be there. This should be removed
The lexer should include unicode support. I'm unsure how this might be best accomplished, I guess one solution could be to handle unicode characters as character combinations, i.e. two normal characters in a row.
The ? is not handled correctly when escaped.
When creating a lexer you cannot run more than one source through it at a given time. Since the source is set the lexer is a stateful object that you call next on. The only way to get another lexer that you could run on at the same time would be to make a full lexer even though this is totally unnecessary.
Separate the runtime info from the lexer into a separate class. When setting the input on the lexer, return a runtime bound to the current input. Many of these classes should then be createable and runnable at the same time.
There should be an option to make a single rule part optional when making fluent configurations. Currently you have to make an empty nothingness rule and make that a production of whatever you want to have optional. This is not very intuitive by any means.
We already have something similar when lists are declared. It would be totally cool if this could be made to work on single entities within a rule as well.
The lexer does not perform a dead state removal. This should be done in order to keep the states at a minimum and conserve memory and processing capabilties.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.