Git Product home page Git Product logo

lexgen's Introduction

lexgen: A fully-featured lexer generator, implemented as a proc macro

use lexgen::lexer;
use lexgen_util::Loc;

lexer! {
    // First line specifies name of the lexer and the token type returned by
    // semantic actions
    Lexer -> Token;

    // Regular expressions can be named with `let` syntax
    let init = ['a'-'z'];
    let subseq = $init | ['A'-'Z' '0'-'9' '-' '_'];

    // Rule sets have names. Each rule set is compiled to a separate DFA.
    // Switching between rule sets is done explicitly in semantic actions.
    rule Init {
        // Rules without a right-hand side for skipping whitespace,
        // comments, etc.
        [' ' '\t' '\n']+,

        // Rule for matching identifiers
        $init $subseq* => |lexer| {
            let token = Token::Id(lexer.match_().to_owned());
            lexer.return_(token)
        },
    }
}

// The token type
#[derive(Debug, PartialEq, Eq)]
enum Token {
    // An identifier
    Id(String),
}

// Generated lexers are initialized with a `&str` for the input
let mut lexer = Lexer::new(" abc123Q-t  z9_9");

// Lexers implement `Iterator<Item=Result<(Loc, T, Loc), LexerError>>`,
// where `T` is the token type specified in the lexer definition (`Token` in
// this case), and `Loc`s indicate line, column, and byte indices of
// beginning and end of the lexemes.
assert_eq!(
    lexer.next(),
    Some(Ok((
        Loc { line: 0, col: 1, byte_idx: 1 },
        Token::Id("abc123Q-t".to_owned()),
        Loc { line: 0, col: 10, byte_idx: 10 }
    )))
);
assert_eq!(
    lexer.next(),
    Some(Ok((
        Loc { line: 0, col: 12, byte_idx: 12 },
        Token::Id("z9_9".to_owned()),
        Loc { line: 0, col: 16, byte_idx: 16 }
    )))
);
assert_eq!(lexer.next(), None);

See also:

Motivation

Implementing lexing is often (along with parsing) the most tedious part of implementing a language. Lexer generators make this much easier, but in Rust existing lexer generators miss essential features for practical use, and/or require a pre-processing step when building.

My goal with lexgen is to have a feature-complete and easy to use lexer generator.

Usage

lexgen doesn't require a build step. Add same versions of lexgen and lexgen_util as dependencies in your Cargo.toml.

Lexer syntax

lexgen lexers start with the name of the generated lexer struct, optional user state part, and the token type (type of values returned by semantic actions). Example:

lexer! {
    Lexer(LexerState) -> Token;
    ...
}

Here the generated lexer type will be named Lexer. User state type is LexerState (this type should be defined by the user). The token type is Token.

After the lexer name and user state and token types we define the rules:

rule Init {
    ...
}

rule SomeOtherRule {
    ...
}

The first rule set will be defining the initial state of the lexer and needs to be named Init.

In the body of a rule block we define the rules for that lexer state. The syntax for a rule is <regex> => <semantic action>,. Regex syntax is described below. A semantic action is any Rust code with the type fn(LexerHandle) -> LexerAction where LexerHandle and LexerAction are generated names derived from the lexer name (Lexer in our example). More on these types below.

Regular expressions can be named with let <name> = <regex>; syntax. Example:

let init = ['a'-'z'];
let subseq = $init | ['A'-'Z' '0'-'9' '-' '_'];

// Named regexes can be used with the `$` prefix
$init $subseq* => |lexer| { ... }

You can omit the rule Init { ... } part and have all of your rules at the top level if you don't need rule sets.

In summary:

  • First line is in form <lexer name>(<user state type>) -> <token type name>. The (<user state type>) part can be omitted for stateless lexers.

  • Next is the rule sets. There should be at least one rule set with the name Init, which is the name of the initial state.

  • let bindings can be added at the top-level or in rules.

Regex syntax

Regex syntax can be used in right-hand side of let bindings and left-hand side of rules. The syntax is:

  • $var for variables defined in the let binding section. Variables need to be defined before used.

  • $$var for built-in regexes (see "Built-in regular expressions" section below).

  • Rust character syntax for characters, e.g. 'a'.

  • Rust string syntax for strings, e.g. "abc".

  • [...] for character sets. Inside the brackets you can have one or more of:

    • Characters
    • Character ranges: e.g. 'a'-'z'

    Here's an example character set for ASCII alphanumerics: ['a'-'z' 'A'-'Z' '0'-'9']

  • _ for matching any character

  • $ for matching end-of-input

  • <regex>* for zero or more repetitions of <regex>

  • <regex>+ for one or more repetitions of <regex>

  • <regex>? for zero or one repetitions of <regex>

  • <regex> <regex> for concatenation

  • <regex> | <regex> for alternation: match the first one, or the second one.

  • <regex> # <regex> for difference: match characters in the first regex that are not in the second regex. Note that regexes on the left and right of # should be "characters sets", i.e. *, +, ?, "...", $, and concatenation are not allowed. Variables that are bound to character sets are allowed.

Binding powers (precedences), from higher to lower:

  • *, +, ?
  • #
  • Concatenation
  • |

You can use parenthesis for grouping, e.g. ('a' | 'b')*.

Example: 'a' 'b' | 'c'+ is the same as (('a' 'b') | ('c'+)).

Right context (lookahead)

A rule in a rule set can be followed by another regex using > <regex> syntax, for right context. Right context is basically a limited form of lookahead: they can only appear after a top-level regex for a rule. They cannot be used nested in a regex.

For example, the rule left-hand side 'a' > (_ # 'b') matches 'a' as long as it's not followed by 'b'.

See also right context tests for more examples.

Built-in regular expressions

lexgen comes with a set of built-in regular expressions. Regular expressions listed below match the same set of characters as their Rust counterparts. For example, $$alphabetic matches the same set of characters as Rust's char::is_alphabetic:

  • $$alphabetic
  • $$alphanumeric
  • $$ascii
  • $$ascii_alphabetic
  • $$ascii_alphanumeric
  • $$ascii_control
  • $$ascii_digit
  • $$ascii_graphic
  • $$ascii_hexdigit
  • $$ascii_lowercase
  • $$ascii_punctuation
  • $$ascii_uppercase
  • $$ascii_whitespace
  • $$control
  • $$lowercase
  • $$numeric
  • $$uppercase
  • $$whitespace

(Note that in the generated code we don't use Rust char methods. For simple cases like $$ascii we generate simple range checks. For more complicated cases like $$lowercase we generate a binary search table and run binary search when checking a character)

In addition, these two built-in regular expressions match Unicode XID_Start and XID_Continue:

  • $$XID_Start
  • $$XID_Continue

Rule syntax

  • <regex> => <semantic action>,: <regex> syntax is as described above. <semantic action> is any Rust code with type fn(&mut Lexer) -> SemanticActionResult<Token>. More on SemanticActionResult type in the next section.

  • <regex> =? <semantic action>,: fallible actions. This syntax is similar to the syntax above, except <semantic action> has type fn(&mut Lexer) -> LexerAction<Result<Token, UserError>>. When using rules of this kind, the error type needs to be declared at the beginning of the lexer with the type Error = UserError; syntax.

    When a rule of this kind returns an error, the error is returned to the caller of the lexer's next method.

  • <regex>,: Syntactic sugar for <regex> => |lexer| { lexer.reset_match(); lexer.continue_() },. Useful for skipping characters (e.g. whitespace).

  • <regex> = <token>,: Syntactic sugar for <regex> => |lexer| lexer.return_(<token>),. Useful for matching keywords, punctuation (operators) and delimiters (parens, brackets).

End-of-input handling in rule sets

The Init rule set terminates lexing successfully on end-of-input (i.e. lexer.next() returns None). Other rule sets fail on end-of-input (i.e. return Some(Err(...))). This is because generally the states other than the initial one are for complicated tokens (strings, raw strings, multi-line comments) that need to be terminated and handled, and end-of-input in those states usually means the token did not terminate properly.

(To handle end-of-input in a rule set you can use $ as described in section "Regex syntax" above.)

Handle, rule, error, and action types

The lexer macro generates a struct with the name specified by the user in the first line of the lexer definition. In the example at the beginning (Lexer -> Token;), name of the struct is Lexer.

A mut reference to this type is passed to semantic action functions. In the implementation of a semantic action, you should use one of the methods below drive the lexer and return tokens:

  • fn match_(&self) -> &str: returns the current match. Note that when the lexer is constructed with new_from_iter or new_from_iter_with_state, this method panics. It should only be called when the lexer is initialized with new or new_with_state.
  • fn match_loc(&self) -> (lexgen_util::Loc, lexgen_util::Loc): returns the bounds of the current match
  • fn peek(&mut self) -> Option<char>: looks ahead one character
  • fn state(&mut self) -> &mut <user state type>: returns a mutable reference to the user state
  • fn return_(&self, token: <user token type>) -> SemanticActionResult: returns the passed token as a match.
  • fn continue_(&self) -> SemanticActionResult: ignores the current match and continues lexing in the same lexer state. Useful for skipping characters.
  • fn switch(&mut self, rule: LexerRule) -> SemanticActionResult: used for switching between lexer states. The LexerRule (where Lexer part is the name of the lexer as specified by the user) is an enum with a variant for each rule set name, for example, LexerRule::Init. See the stateful lexer example below.
  • fn switch_and_return(&mut self, rule: LexerRule, token: <user token type>) -> SemanticActionResult: switches to the given lexer state and returns the given token.
  • fn reset_match(&mut self): resets the current match. E.g. if you call match_() right after reset_match() it will return an empty string.

Semantic action functions should return a SemanticActionResult value obtained from one of the methods listed above.

Initializing lexers

lexgen generates 4 constructors:

  • fn new(input: &str) -> Self: Used when the lexer does not have user state, or user state implements Default.

  • fn new_with_state(input: &str, user_state: S) -> Self: Used when the lexer has user state that does not implement Default, or you want to initialize the state with something other than the default. S is the user state type specified in lexer definition. See stateful lexer example below.

  • fn new_from_iter<I: Iterator<Item = char> + Clone>(iter: I) -> Self: Used when the input isn't a flat string, but something like a rope or zipper. Note that the match_ method panics when this constructor is used. Instead use match_loc to get the location of the current match.

  • fn new_from_iter_with_state<I: Iterator<Item = char> + Clone, S>(iter: I, user_state: S) -> Self: Same as above, but doesn't require user state to implement Default.

Stateful lexer example

Here's an example lexer that counts number of =s appear between two [s:

lexer! {
    // `usize` in parenthesis is the user state type, `usize` after the arrow
    // is the token type
    Lexer(usize) -> usize;

    rule Init {
        $$ascii_whitespace,                             // line 7

        '[' => |lexer| {
            *lexer.state() = 0;                         // line 10
            lexer.switch(LexerRule::Count)              // line 11
        },
    }

    rule Count {
        '=' => |lexer| {
            *lexer.state() += 1;                        // line 17
            lexer.continue_()                           // line 18
        },

        '[' => |lexer| {
            let n = *lexer.state();
            lexer.switch_and_return(LexerRule::Init, n) // line 23
        },
    }
}

let mut lexer = Lexer::new("[[ [=[ [==[");
assert_eq!(
    lexer.next(),
    Some(Ok((
        Loc { line: 0, col: 0, byte_idx: 0 },
        0,
        Loc { line: 0, col: 2, byte_idx: 2 },
    )))
);
assert_eq!(
    lexer.next(),
    Some(Ok((
        Loc { line: 0, col: 3, byte_idx: 3 },
        1,
        Loc { line: 0, col: 6, byte_idx: 6 },
    )))
);
assert_eq!(
    lexer.next(),
    Some(Ok((
        Loc { line: 0, col: 7, byte_idx: 7 },
        2,
        Loc { line: 0, col: 11, byte_idx: 11 },
    )))
);
assert_eq!(lexer.next(), None);

Initially (the Init rule set) we skip spaces (line 7). When we see a [ we initialize the user state (line 10) and switch to the Count state (line 11). In Count, each = increments the user state by one (line 17) and skips the match (line 18). A [ in Count state returns the current number and switches to the Init state (line 23).

lexgen's People

Contributors

alan-j-hu avatar misawa avatar osa1 avatar sahandevs avatar seanyoung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

lexgen's Issues

`switch_and_return` does not consume the returned token?

First, I want to thank you for writing lexgen. I find it incredibly useful when paired with LALRPOP (and would at some point like to contribute some integration to it, though I'm not sure exactly what that should look like right now).

I have a lexer with two rules- Init and NoteBody. The purpose of NoteBody is to consume characters until a newline is found, and the change between rules happens when an = (Tok::Equal) is detected and certain conditions (in_note_name state) are met. The logic looks like this:

"=" => |mut lexer| {
    if lexer.state().in_note_name {
        lexer.switch_and_return(LexerRule::NoteBody, Tok::Equal)
    } else {
        lexer.return_(Tok::Equal)
    }
},

When I've detected a newline in the NoteBody rule by peek()ing, I want to return all the characters I've found after the equals sign and up to (not including) the newline. I would expect the logic for this to look something like:

let match_ = lexer.match_();
...
lexer.switch_and_return(LexerRule::Init, Tok::NoteBody(&match_))

However, it turns out that this will return the = token prepended to the text I actually want to match. In order to return just the text after the = token, I need to do:

let match_ = lexer.match_();
...
lexer.switch_and_return(LexerRule::Init, Tok::NoteBody(&match_[1..=match_end]))

My question is: Is this a bug, or intended behavior? If the latter, is my workaround to strip the first character from the match correct, or can you suggest a method for removing the = sign before the transition from the Init to NoteBody rule?

Consider inlining accepting states without transitions

I will come up with a minimal repro, but here's an example for now:

match self.state {
    ...
    56usize => match self.iter.peek().copied() {
        ...
        Some((char_idx, char)) => match char {
            '\u{5c}' => {
                self.current_match_end += char.len_utf8();
                let _ = self.iter.next();
                self.state = 58usize;
            }
            ...
        },
    },
    ...
    58usize => {
        let rhs: fn(LexerHandle<'_, 'input>) -> LexerAction<Token> =
            |lexer| lexer.switch(LexerRule::StringEscape);
        let str = &self.input[self.current_match_start..self.current_match_end];
        let handle = LexerHandle {
            iter: &mut self.iter,
            match_: str,
            user_state: &mut self.user_state,
        };
        match rhs(handle) {
            LexerAction::Continue => {
                self.state = self.initial_state;
                continue;
            }
            LexerAction::Return(tok) => {
                self.state = self.initial_state;
                return Some(Ok((
                    self.current_match_start,
                    tok,
                    self.current_match_end,
                )));
            }
            LexerAction::Switch(rule_set) => {
                self.switch(rule_set);
                continue;
            }
            LexerAction::SwitchAndReturn(tok, rule_set) => {
                self.switch(rule_set);
                return Some(Ok((
                    self.current_match_start,
                    tok,
                    self.current_match_end,
                )));
            }
        }
    }
    ...
}

State 56 has a transition to 58. State 58 is an accepting state and it doesn't have any transitions.

In general, when a state is an accepting state and doesn't have any transitions we can "inline" it and avoid (1) extra states (2) redundant looping (no need to update the state, continue).

In the example state 58 is only referred by 56. In general I'm guessing this doesn't have to be the case, but I suspect it may still be OK to inline accepting states.

The state type is required to implement Default even if client code never calls `new` or `new_from_iter`

The README claims the following:

lexgen/README.md

Lines 302 to 308 in 6be10c9

- `fn new(input: &str) -> Self`: Used when the lexer does not have user state,
or user state implements `Default`.
- `fn new_with_state(input: &str, user_state: S) -> Self`: Used when the lexer
has user state that does not implement `Default`, or you want to initialize
the state with something other than the default. `S` is the user state type
specified in lexer definition. See stateful lexer example below.

However, as the library currently stands, the state type is required to implement Default even if the new and new_from_iter functions are never used. The reason is that the new and new_from_iter do not list $StateType: Default as a constraint, yet assume an implementation of Default in their bodies, and therefore if $StateType does not implement Default, the compilation of the entire program fails, even if the client never uses those functions.

A partial solution, which I implemented in the process of making #53, is to add $StateType: Default as an explicit constraint. However, this only bypasses the problem if$StateType contains generic parameters. So, if $StateType does not implement Default, $StateType = Foo<'input> and $StateType = Foo<'a> are fine, but $StateType = Foo<'static> and $StateType = Foo are not. The fact that Rust rejects Foo<'static>: Default means that I had to do 'input: static and Foo<'input>, which solves the issue but creates warnings that the generic parameter 'input should be replaced with 'static. This is also only a partial solution because it does not allow a state type that does not implement Default if the state has no generic parameters.

(In my opinion, the rejection of a function with constraint Foo: Default, where Foo does not implement Default, is an illogical design choice on the part of Rust. If the constraint Foo: Trait is not satisfied, that is no different in principle than if Foo<T>: Trait is not satisfied for some choice of T. If Foo<T>: Trait is not satisfied for some choice of T, that simply means that the function is not usable for such T. If Foo: Trait is not satisfied where Foo has no generic parameters, this should simply mean that the function is not usable under any circumstances, not that the compiler should reject the function definition. This illogical design choice unfortunately makes my PR messy.)

Implement Default for Loc

Thank you for the amazing lexer!

The generated lexer seems to be 99% compatible with what lalrpop takes. The only incompatibility is the lack of Default implementation for lexgen_util::Loc, which lalrpop uses to initialize the location.
If there's no reason not to implement it, it'd be great to have it implemented so that the following just works (with type Location = lexgen_util::Loc and type Error = lexgen_util::LexerError<MyError> defined in the extern section of the .lalrpop file).

let lexer = Lexer::new(query); // The lexgen-generated lexer
let query = QueryParser::new() // The lalrpop-generated parser
    .parse(query, lexer)
    .unwrap();

Regex compilation bug when prefix is shared by regexes

use lexgen::lexer;

fn ignore_pos<A, E>(ret: Option<Result<(usize, A, usize), E>>) -> Option<Result<A, E>> {
    ret.map(|res| res.map(|(_, a, _)| a))
}

fn next<A, E>(
    iter: &mut dyn Iterator<Item = Result<(usize, A, usize), E>>,
) -> Option<Result<A, E>> {
    ignore_pos(iter.next())
}

fn return_match<'lexer, 'input>(lexer: LexerHandle<'lexer, 'input>) -> LexerAction<&'input str> {
    let match_ = lexer.match_();
    lexer.return_(match_)
}

lexer! {
    Lexer -> &'input str;

    "xyzxyz" => return_match,
    "xyz" => return_match,
    "xya" => return_match,
}

fn main() {
    let mut lexer = Lexer::new("xyzxya");
    assert_eq!(next(&mut lexer), Some(Ok("xyz")));  // fails
    assert_eq!(next(&mut lexer), Some(Ok("xya")));
    assert_eq!(next(&mut lexer), None);
}

Interestingly, we have very similar examples that work fine. For example, Lua lexer has regexes for "elseif" and "else".

match until/til?

for example, how to lex rust like raw string: r###"..."###

rule Init {
  'r' '#'* '"' => |lexer| {
    *lexer.state() = lexer.match_().len() - 1; // set the number of end mark โ€œ#...
    lexer.switch_and_return(LexerRule::RawText, Token::RawTextStart)
  },
}

rule RawText {
  // here, I dont know how to constraint to number of '#'
  _* '"' '#'* => |lexer| {
    len n = *lexer.state();
    let cont = lexer.match_();
    let body = &cont[..cont.len() - n];
    lexer.switch_and_return(LexerRule::Init, Token::RawTextCont(body, n));
  },
}

Implement location tracking and a way to get a match's location

Currently error messages return byte index of the error, and we don't provide a method to semantic actions for getting the current locations.

With error messages we should also (in addition to byte index) return the line and column number, and we should provide a method to get the current line and column number.

To allow attaching location information to tokens, we should also return the start and end locations of lexemes when calling semantic actions.

Confusing match reset behavior in initial state

(Issue originally discovered in #11)

self.current_match_start = char_idx;

The line above resets the current match every time we're back at the initial state (e.g. after switch(LexerRule::Init), or simply by looping in the initial state).

This behavior is a bit confusing, and it seems hacky too. I don't remember the original motivation for implementing this, but if I remove this special case and look at the broken tests, it seems like we rely on this to skip whitespace:

rule Init {
    $whitespace,
    ...
}

In a rule like this, if we reset the current match every time we reach the initial state, that means the skipped whitespace will never be a part of the current match. So if I have other cases in this rule where I use lexer.match_() I won't ever see the skipped characters in the match.

In addition to being perhaps a bit counter-intuitive, this is a special case in the initial rule, so if I have a different rule I can't easily have the same behavior (I need to count the amount of whitespace matched, and manually drop the prefix).

Reminder: other than the special case above, we only reset the match after returning a token.


One potential fix would be to never implicitly reset the match, and implement a method for resetting the current match.

For skipping whitespace concisely, we could either desugar the rule syntax without RHS ($whitespace, above) to "reset and continue", or keep it the way it is (just "continue") and implement skipping whitespace as:

rule Init {
    $whitespace => |mut lexer| {
        lexer.reset_match();
        lexer.continue_();
    },
    ...
}

Either way there will be some breakage. If we desugar it as "reset and continue" then uses in non-initial rules will break. If we desugar it as just "continue" then uses in the initial rule will break.

Consider merging char and range transitions in NFA and DFA

Currently we have these transitions in NFAs:

struct State<A> {
    char_transitions: Map<char, Set<StateIdx>>,
    range_transitions: RangeMap<Set<StateIdx>>,
    empty_transitions: Set<StateIdx>,
    any_transitions: Set<StateIdx>,
    end_of_input_transitions: Set<StateIdx>,
    ...
}

(I was confused for a few seconds here on why we have both empty_transitions and end_of_input_transitions, we should document that "empty" is epsilon)

and these in DFAs:

pub struct State<T, A> {
    char_transitions: Map<char, T>,
    range_transitions: RangeMap<T>,
    any_transition: Option<T>,
    end_of_input_transition: Option<T>,
    ...
}

In both of these, we could merge char and range transitions in a single RangeMap field. When the transition happens on a character the range will include a single character. This has a few advantages:

  • It simplifies things and adds no new or special cases to consider, as range transitions can already have just one character today.

    • I think this case may be handled poorly in the code generator today, e.g. maybe we're generating guards like x >= 'a' && x <= 'a'. We should check this.
  • When we implement DFA minimization (#38) we will have to iterate all inputs of a partition (i.e. set of states). There we will have to merge (or actually, take difference of) characters and ranges. For example, in a partition, if a state has 'a' as input, another has ['a'-'z'] as input, we will have to test the inputs 'a' and ['b'-'z'] on the partition. I don't know how to implement this yet, but clearly we need to consider range-range overlaps and also range-char overlaps. Removing range vs. char distinction means less cases to handle.

    Actually, all states in a partition need to handle the same range, otherwise they can be distinguished and so should be moved to separate partitions. So we will have [RangeMap] (one for each state in the partition), split the partition into new partitions where for every state in a new partition, the RangeMaps have the same domain. Then we can use any of the RangeMaps to make sure they map same inputs to same partitions. Having just RangeMaps simplifies this.

Disadvantage is that for char transitions we will need heap allocation for the Vec<Range>. Two ways to avoid that:

  • Use SmallVec<[Range; 1]>
  • Make RangeMap an enum, with a variant for single-character ranges. With this we will still need to treat single-char ranges differently, but the handling will be encapsulated in RangeMap module. Outside of RangeMap we will only have ranges. (except in the codegen we will need to ask if a range is single-char to generate simpler conditionals)

Lookahead could be useful

Suppose I'm writing a Rust lexer. Naively, lexing integer literals could be implemented like this:

$dec_digit ($dec_digit | '_')* $int_suffix? => ...,

"0b" ($bin_digit | '_')* $bin_digit ($bin_digit | '_')* $int_suffix? => ...,

"0o" ($oct_digit | '_')* $oct_digit ($oct_digit | '_')* $int_suffix? => ...,

"0x" ($hex_digit | '_')* $hex_digit ($hex_digit | '_')* $int_suffix? => ...,

The problem with these rules is they generate multiple tokens for invalid literals like 0invalidSuffix or 123AFB43.

Currently the way to fix this is by introducing new rules for each of these literals:

rule Init {
    ...

    $dec_digit => |lexer| lexer.switch(LexerRule::DecInt),

    "0b" => |lexer| lexer.switch(LexerRule::BinInt),

    "0o" => |lexer| lexer.switch(LexerRule::OctInt),

    "0x" => |lexer| lexer.switch(LexerRule::HexInt),
}

rule DecInt {
    ($dec_digit | '_')* $int_suffix?,

    $ => ... return token ...,

    $whitespace => ... return token ...,
}

rule BinInt { ... }

rule OctInt { ... }

rule HexInt { ... }

What these rules do is when we see a character for an integer literal, we consume every character until the end of the input or we see a whitespace. Seeing an invalid character for the literal becomes an error.

If we had a regex for "followed by", we could use the rules in the first version, with a "followed by whitespace or end-of-input" part:

$dec_digit ($dec_digit | '_')* $int_suffix? > ($whitespace | $) => ...,

where > is the "followed by" syntax.

The difference from concatenation is the "followed by" part should not be included in the match.

Overlapping ranges break lexing

Here's the test:

lexer! {
    Lexer -> usize;

    ['a'-'b'] '1' = 1,
    ['a'-'c'] '2' => 2,
}

let mut lexer = Lexer::new("a1");
assert_eq!(ignore_pos(lexer.next()), Some(Ok(1)));
assert_eq!(lexer.next(), None);

let mut lexer = Lexer::new("a2");
assert_eq!(ignore_pos(lexer.next()), Some(Ok(2)));
assert_eq!(lexer.next(), None);

Lexing a1 works fine but we can't lex a2. I don't remember the code for adding ranges now and I didn't check yet, but I think bug is in handling of ranges: we implicitly assume character ranges don't overlap. In the example above, when adding the range 'a'-'c', we need to consider existing ranges at the same state ('a'-'b'), and make sure that the overlapping parts of the ranges will non-deterministically switch to the states of all of the overlapping ranges. So when 'a'-'b' matches we need to go to states for both '1' and '2'.

Provide a way to match end-of-input in non-initial states

(See also #12 for another different behavior in initial and non-initial states)

Suppose I want to lex C-style single line comments: // ...

lexer! {
    Lexer -> &'input str;

    rule Init {
        "//" => |lexer| {
            lexer.switch(LexerRule::SingleLineComment)
        },
    }

    rule SingleLineComment {
        '\n' => |lexer| {
            let comment = lexer.match_();
            lexer.switch_and_return(LexerRule::Init, comment)
        },

        _,
    }
}

This won't lex EOF-terminated comments because _ does not match EOF:

#[cfg(test)]
fn ignore_pos<A, E>(ret: Option<Result<(usize, A, usize), E>>) -> Option<Result<A, E>> {
    ret.map(|res| res.map(|(_, a, _)| a))
}

#[test]
fn comment() {
    let input = "// test";
    let mut lexer = Lexer::new(input);
    assert_eq!(ignore_pos(lexer.next()), Some(Ok(input))); // fails
    assert_eq!(ignore_pos(lexer.next()), None);
}

I don't know if we should make _ match EOF, or have another symbol for matching EOF explicitly.

Consider generating a warning (with an option to suppress) for rules that accept empty string

Recently I debugged a lexer with this rule:

("0b" | "0o" | "0x")? ($digit | '_')* $id? = ...,

This regex accepts empty string, so the lexgen state never failed and instead looped when it's supposed to fail.

Now, I've never needed this in a lexer, but this kind of rule can be used to switch to another state on failure, so I guess it's not completely useless. However I suspect for the majority of the use cases it actually means a bug in the lexer, so perhaps it makes sense to generate a warning for rules that accept empty strings, with a pragma to explicitly suppress the warning.

Provide a way to fail with "lexer error"

Currently there's no way to fail with lexgen's "lexer error" when an error is detected. We need to specify a user error type, and fail using a value of that type.

It could be useful to provide a fail method in the lexers to fail with the standard "lexer error" error, without having to define an error type.

Use case: Rust string literals do not allow standalone carriage return, i.e. \r needs to be followed by \n.

One way to implement it is this:

lexer! {
    rule Init {
        ...

        '"' => |lexer| lexer.swtich(LexerRule::String),
    }

    rule String {
        ...

        '\r' => |lexer| lexer.switch(LexerRule::StringCR),

        _,
    }

    rule StringCR {
        '\n' => |lexer| lexer.switch(LexerRule::String),
    }
}

If we had a fail method on lexer structs it could be slightly more concise:

lexer! {
    rule Init {
        ...

        '"' => |lexer| lexer.swtich(LexerRule::String),
    }

    rule String {
        ...

        "\r\n",

        '\r' => |lexer| lexer.fail(),

        _,
    }
}

We don't need another rule to make sure \r is always followed by \n. We can already do this today, but we have to define an error type and return an error: (not tested)

lexer! {
    type Error = UserError;

    rule Init {
        ...

        '"' => |lexer| lexer.swtich(LexerRule::String),
    }

    rule String {
        ...

        "\r\n",

        '\r' =? |lexer| lexer.return_(Err(UserError { ... })),

        _,
    }
}

Implement a way to bind match results in patterns

Suppose I want to parse \xHH syntax in a rule (H is a hex digit). Currently it's quite painful to get the chars for those digits in a larger rule:

"\\x" $hex_digit $hex_digit => |mut lexer| {
    let match_ = lexer.match_();
    let bytes = match_.as_bytes();

    let digit1 = bytes[bytes.len() - 2];
    let digit2 = bytes[bytes.len() - 1];

    ...
},

The problem is the match will have the input since the beginning of the current match, so for example, if this is in a string lexing rule and the string is "blah blah \xAA" then match will be the entire string except the closing ".

One workaround is to add another rule for the hex digits and push them to a buffer in each match, but that's also verbose.

If we had a way to bind regexes inside patterns we could do:

"\\x" <d1:$hex_digit> <d2:$hex_digit> => |mut lexer| {
    ...
},

where d1 and d2 would be chars.

Compile DFAs as lookup tables instead of linear if-elses

It's not uncommon for a lexer state to be compiled to dozens of states. Currently we generate code like this:

match self.state {
    0 -> ...,
    1 -> ...,
    ...
    _ -> ...,
}

For N states this means N-1 comparisons in the worst case to the take last branch.

Instead we should compile each state to a function, then put them all in an array:

static STATES: [fn(...) -> ...; N] = [...];

Then just do STATES[self.state](...);.

Allow initializing lexers with a character iterator

Currently in main branch and the versions released on crates.io, generated lexers need to be initialized with a &str argument for the input.

It's also useful to initialize the lexers with a Iterator<Item = char> + Clone argument, as I do in from_iter branch (used in my unannounced text editor project). The problem is without &'input str we can't return a &'input str for the current match.

Instead what we can return in match_ is the byte and character bounds of the current match. The user can then extract a string from the input if they have it as a &str.

Another alternative could be to maintain a String buffer in generated lexers, and return a slice to it (possibly with character and byte bounds of the match). The function will look like fn match_(&self) -> (&str, ... char and byte bounds ...). Users can then clone the string if they need to store it, or ignore it.

I think the first alternative (don't return a slice or string, just byte and char bounds) may be good enough. The the user has the input as a string, they can use the byte indices returned by match_ to get the slice of the match. So it's possible to get the current implementation using a lexer that takes Iterator<Item = char> + Clone as the input stream.

Invalid token doesn't consume input from match_ in non-initial rule

This works as I expected

lexer! {
    Lexer -> &'input str;

    rule Init {
        's' = "s",
        "a" => |lexer| lexer.return_(lexer.match_()),
    }
}

let input = "sxa";
let mut lexer = Lexer::new(input);
assert_eq!(lexer.next(), Some(Ok((loc(0, 0, 0), "s", loc(0, 1, 1)))));
assert_eq!(lexer.next(), Some(Err(LexerError { location: loc(0, 1, 1), kind: crate::LexerErrorKind::InvalidToken })));
assert_eq!(lexer.next(), Some(Ok((loc(0, 2, 2), "a", loc(0, 3, 3)))));
assert_eq!(lexer.next(), None);

though if I move "a" to a different rule and let "s" enter the rule, it starts behaving differently while I expected to return exactly same as above one

lexer! {
    Lexer -> &'input str;

    rule Init {
        's' => |lexer| lexer.switch_and_return(LexerRule::InString, "s"),
    }
    rule InString {
        "a" => |lexer| lexer.return_(lexer.match_()),
    }
}

let input = "sxa";
let mut lexer = Lexer::new(input);
assert_eq!(lexer.next(), Some(Ok((loc(0, 0, 0), "s", loc(0, 1, 1)))));
assert_eq!(lexer.next(), Some(Err(LexerError { location: loc(0, 1, 1), kind: crate::LexerErrorKind::InvalidToken })));
// Why "x" is included here???
assert_eq!(lexer.next(), Some(Ok((loc(0, 1, 1), "xa", loc(0, 3, 3)))));
// Perhaps this is because we don't switch back to the Init rule?
assert_eq!(lexer.next(), Some(Err(LexerError { location: loc(0, 3, 3), kind: crate::LexerErrorKind::InvalidToken })));
assert_eq!(lexer.next(), None);

It'd be great if

  • InvalidToken consume the invalid part always, and
  • if the last LexerError is expected, have it documented in README.md (and perhaps return a different error kind than InvalidToken?)

Wrong lexing when rules share prefix

use lexgen::lexer;
use lexgen_util::Loc;

lexer! {
    Lexer -> usize;

    "'" _ "'" = 1,
    "'" ['a'-'z']+ = 2,
}

#[test]
fn char_lit() {
    let input = "'a'";
    let mut lexer = Lexer::new(input);
    // Fails:
    assert_eq!(lexer.next(), Some(Ok((Loc { line: 0, col: 0, byte_idx: 0 }, 1, Loc { line: 0, col: 3, byte_idx: 3 }))));
    assert_eq!(lexer.next(), None);
}

Generate DFA directly (no NFA)

If we generate the DFA directly without going through NFA:

  • Should be more efficient
  • Should generate a slightly better DFA

Empty rule fails or loops depending on other rules

use lexgen::lexer;

lexer! {
    Lexer -> &'input str;

    let whitespace =
        ['\t' '\n' '\u{B}' '\u{C}' '\r' ' ' '\u{85}' '\u{200E}' '\u{200F}' '\u{2028}' '\u{2029}'];

    let dec_digit = ['0'-'9'];

    let hex_digit = ['0'-'9' 'a'-'f' 'A'-'F'];

    let int_suffix = ('u' | 'i') '8' | ('u' | 'i') "16" | ('u' | 'i') "32" |
            ('u' | 'i') "64" | ('u' | 'i') "128" | ('u' | 'i') "size";

    rule Init {
        "0x" => |lexer| lexer.switch(LexerRule::HexInt),

        $dec_digit => |lexer| lexer.switch(LexerRule::DecInt),
    }

    rule DecInt {
        ($dec_digit | '_')* $int_suffix?,

        $ => |lexer| {
            let match_ = lexer.match_();
            lexer.return_(match_)
        },

        $whitespace => |lexer| {
            let match_ = lexer.match_();
            lexer.return_(&match_[..match_.len() - match_.chars().last().unwrap().len_utf8()])
        },
    }

    rule HexInt {

    }
}

fn ignore_loc<A, E, L>(ret: Option<Result<(L, A, L), E>>) -> Option<Result<A, E>> {
    ret.map(|res| res.map(|(_, a, _)| a))
}

fn main() {
    let mut lexer = Lexer::new("0xff");
    assert_eq!(ignore_loc(lexer.next()), Some(Ok("0xff")));
}

The repro above loops, which is a bug. An empty rule should always fail.

Interestingly, if I remove the contents of the rule DecInt, it starts to fail, but I think the error value is not right:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `Some(Err(InvalidToken { location: Loc { line: 0, col: 1, byte_idx: 1 } }))`,
 right: `Some(Ok("0xff"))`', src/main.rs:37:5

I would expect the error location to be column 2 as 0x part should be consumed for the rule starting with "0x".

Question about unicode categories

I need to match a token which contains unicode scalar values in the categories L, M, N, P, S and Cf.

I see three different ways to solve this:

  1. I wrote a small tool to generate the character set (about 15 kB of text) and include it
  2. I use a simpler character set and verify with a regex in the semantic action
  3. I hack lexgen

@osa1, what do you think?

Find a better metavariable syntax in documentation

In the regex section we use <regex> syntax for metavariable "regex", but that syntax does not look great in the next section when we show Rust code. For example, we say things like:

If lexer is declared as Lexer -> Token;
...
LexerAction<Token>

Ideally we should indicate that Lexer in LexerAction is the user-specified lexer name, and Token is the return type (RHS of the arrow).

Some options:

  • Keep using <...>: <Lexer>Action<<Token>>. I think double angle brackets around Token don't look good.
  • Use $(...): $(Lexer)Action<$(Token)>. I like this better, because it looks somewhat similar to Rust macro syntax for splicing. (though you can't generate new symbols using this syntax in Rust, I think).
  • Other ideas?

Windows-style line endings cause panic on Lua 5.1 example

Haven't experimented much with the core package yet but right off the bat I noticed that when running the Lua example using Unix-style line endings, it works fine however, running the example using Windows-style line endings, it panics.

Return multiple tokens

Sometimes I want a lexer rule to be able to return multiple tokens, e.g. to emit a dummy token so parser can use it as an end-marker for some syntax. Maybe I should just use Lexer -> Vec<MyToken> and flatten it later, though it'd be great if this is supported by the library side.

Avoid redunant backtrack state updates

In the Lua lexer I see code like

'>' => {
    self.0.set_accepting_state(Lexer_ACTION_13);          // 2
    match self.0.next() {
        None => {
            self.0.__done = true;
            match self.0.backtrack() {                    // 6
                ...
            }
        }
        Some(char) => match char {
            '>' => {
                self.0.reset_accepting_state();           // 12
                match Lexer_ACTION_31(self) {
                    ...
                }
            }
            '=' => {
                self.0.reset_accepting_state();           // 18
                match Lexer_ACTION_11(self) {
                    ...
                }
            }
            _ => match self.0.backtrack() {               // 23
                ...
            },
        },
    }
}

Here we don't need to set accepting state in line 2. Lines 6 and 23 call the semantic action set in line 2, and could call the function directly instead of going through backtrack. Lines 12 and 18 ignore accepting state entirely and call other semantic action functions.

I think if we move set_accepting_state calls to the branches it may become easier in the code generator to handle these cases. In that case the code above would look like:

'>' => {
    match self.0.next() {
        None => {
            self.0.set_accepting_state(Lexer_ACTION_13);       
            self.0.__done = true;
            match self.0.backtrack() {                
                ...
            }
        }
        Some(char) => match char {
            '>' => {
                self.0.set_accepting_state(Lexer_ACTION_13);       
                self.0.reset_accepting_state();      
                match Lexer_ACTION_31(self) {
                    ...
                }
            }
            '=' => {
                self.0.set_accepting_state(Lexer_ACTION_13);       
                self.0.reset_accepting_state();     
                match Lexer_ACTION_11(self) {
                    ...
                }
            }
            _ => {
                self.0.set_accepting_state(Lexer_ACTION_13);       
                match self.0.backtrack() {        
                    ...
                }
            }
        },
    }
}

Now in generate_state_arm, we can avoid generating set_accepting_state if we backtrack or call another semantic action function. We also need to use peek instead of next.

Consider moving semantic actions to functions, leave inlining decisions to LLVM/rustc

With #7 we now generate much more efficient code, but in principle it could lead to code size explosion when some semantic actions have large code and lots of incoming edges in the DFA. I didn't see this happen in practice yet, but I think it could happen.

If we introduce a top-level function for semantic actions, then duplicated code will just be a call expression to the semantic action function. Inlining the semantic action code is then left to LLVM/rustc. I think this would give us more or less the same result for small semantic actions and avoid code size blowup in the worst case.

continue_ doesn't consume input in non-initial rule

In the following example, there's a pattern in the Secondary rule intended to skip whitespace. But the whitespace ends up part of the following match.

use lexgen::lexer;

#[derive(Debug, PartialEq)]
enum Token<'input> {
    Keyword,
    Identifier(&'input str),
}

lexer! {
    Lexer -> Token<'input>;

    rule Init {
        "do" => |lexer| {
            lexer.switch_and_return(LexerRule::Secondary, Token::Keyword)
        },
    },

    rule Secondary {
        $$ascii_whitespace,
        $$ascii_alphanumeric+ => |lexer| {
            lexer.return_(Token::Identifier(lexer.match_()))
        },
    },
}

#[test]
fn test() {
    let mut lexer = Lexer::new("do thing");
    assert_eq!(lexer.next().unwrap().unwrap().1, Token::Keyword);
    assert_eq!(lexer.next().unwrap().unwrap().1, Token::Identifier("thing"));
}

Expected result: Test passes

Actual result:

assertion failed: `(left == right)`
  left: `Identifier(" thing")`,
 right: `Identifier("thing")`

Note: This only happens in a non-initial rule. If I redefine the lexer as follows, the test will pass:

lexer! {
    Lexer -> Token<'input>;

    rule Init {
        $$ascii_whitespace,
        "do" = Token::Keyword,
        $$ascii_alphanumeric+ => |lexer| {
            lexer.return_(Token::Identifier(lexer.match_()))
        },
    },
}

Am I misunderstanding something or is this a bug?

Implement fast paths for ASCII in unicode regexes

Some of the search tables for built-in unicode regular expressions are quite large, but I think 99.9999% of the time they will match ASCII characters, so we should implement a fast path for that. For example, XID_Start currently makes 9 range comparisons to match a.

Rule templates?

Here are rules I'm using to lex Rust decimal, binary, octal, and hexadecimal numbers:

rule DecInt {
    $dec_digit,
    '_',

    $int_suffix | $ => |lexer| {
        let match_ = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Lit(Lit::Int(match_)))
    },

    $whitespace => |lexer| {
        let match_ = lexer.match_();
        // TODO: Rust whitespace characters 1, 2, or 3 bytes long
        lexer.switch_and_return(
            LexerRule::Init,
            Token::Lit(Lit::Int(&match_[..match_.len() - match_.chars().last().unwrap().len_utf8()]))
        )
    },
}

rule BinInt {
    $bin_digit,
    '_',

    $int_suffix | $ => |lexer| {
        let match_ = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Lit(Lit::Int(match_)))
    },

    $whitespace => |lexer| {
        let match_ = lexer.match_();
        // TODO: Rust whitespace characters 1, 2, or 3 bytes long
        lexer.switch_and_return(
            LexerRule::Init,
            Token::Lit(Lit::Int(&match_[..match_.len() - match_.chars().last().unwrap().len_utf8()]))
        )
    },
}

rule OctInt {
    $oct_digit,
    '_',

    $int_suffix | $ => |lexer| {
        let match_ = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Lit(Lit::Int(match_)))
    },

    $whitespace => |lexer| {
        let match_ = lexer.match_();
        // TODO: Rust whitespace characters 1, 2, or 3 bytes long
        lexer.switch_and_return(
            LexerRule::Init,
            Token::Lit(Lit::Int(&match_[..match_.len() - match_.chars().last().unwrap().len_utf8()]))
        )
    },
}

rule HexInt {
    $hex_digit,
    '_',

    $int_suffix | $ => |lexer| {
        let match_ = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Lit(Lit::Int(match_)))
    },

    $whitespace => |lexer| {
        let match_ = lexer.match_();
        // TODO: Rust whitespace characters 1, 2, or 3 bytes long
        lexer.switch_and_return(
            LexerRule::Init,
            Token::Lit(Lit::Int(&match_[..match_.len() - match_.chars().last().unwrap().len_utf8()]))
        )
    },
}

These rules are all the same, except the "digit" part: for binary numbers I'm using $bin_digit regex for the digits, for hex I'm using $hex_digit, and similar for other rules.

If we could implement "rule templates" that take regex as arguments, we could do have one template with a "digit" parameter, and pass $hex_digit, $oct_digit, etc. to it and avoid duplication.

Provide a way to exclude characters

I can only find hypothetical cases for this feature, but maybe there are some real-world use cases too, so opening an issue to document those cases and implement the feature if needed.

Suppose I have a syntax like this: X can be followed by any character, except Y. Currently I need to do something like this: [0..Y-1] | [Y+1..]. This is not going to scale when I exclude multiple characters.

Note that a rule like this does not work as expected:

lexer! {
    ...

    rule Init {
        ...
        X => |lexer| lexer.switch(LexerRule::MyRule),
    }

    rule MyRule {
        Y =? |lexer| { ... raise an error ... },
        _,
    }
}

This fails when it sees XY, instead of yielding lexeme X first, and then lexing rest of the string starting with Y.

Currently the only way to implement such rules is by explicitly providing accepted character ranges.

In practice I haven't encountered this case yet. In cases like Rust string literals where standalone \r is not allowed, it is an error to see disallowed characters, so the rule example above is exactly what we need. I don't know any cases where an unexpected character is not an error but a new lexeme.

Allow let bindings interleaved with rules

Currently let bindings need to come before rules. This is inconvenient in large lexers. For example, I'm working on a Rust lexer, and I currently have these let bindings:

    // https://doc.rust-lang.org/reference/whitespace.html
    let whitespace =
        ['\t' '\n' '\u{B}' '\u{C}' '\r' ' ' '\u{85}' '\u{200E}' '\u{200F}' '\u{2028}' '\u{2029}'];

    let oct_digit = ['0'-'7'];
    let dec_digit = ['0'-'9'];
    let hex_digit = ['0'-'9' 'a'-'f' 'A'-'F'];
    let bin_digit = '0' | '1';
    let int_suffix = ('u' | 'i') '8' | ('u' | 'i') "16" | ('u' | 'i') "32" |
            ('u' | 'i') "64" | ('u' | 'i') "128" | ('u' | 'i') "size";

There will be more in the final version. Ideally I shouldn't have to declare regexes specific to parsing numbers before everything else. I should be able to declare common regexes for numbers right before the rules for lexing numbers.

Implement an easy way to get the matched char in a wildcard rule

Currently getting the matched character in a wildcard is quite verbose (and probably also inefficient):

_ => |mut lexer| {
    let char = lexer.match_().chars().next_back().unwrap();
    ...
}

One easy fix would be to add a char method to lexer that returns the last matched character.

Alternatively with #9 we could allow <char:_> => ... syntax.

Provide a way to specify error value when failing to match

This is related to #35 and we use the same example.

Suppose in b"\xa" I want to fail with "invalid hex escape".

With a "cut" operator as described in #35 the best we can have in a concise way is an "invalid token" error.

To raise a "invalid hex escape" error we need to use new rules. For example:

rule ByteString {
    "\\x" => |lexer| lexer.switch(LexerRule::ByteStringHexEscape),

    ($ascii_for_string | $byte_escape | $string_continue | "\r\n")* '"' => |lexer| {
        let match_ = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Lit(Lit::ByteString(match_)))
    },
}

rule ByteStringHexEscape {
    $hex_digit $hex_digit => |lexer| lexer.switch(LexerRule::ByteString),
    $ | _ | _ _ =? |lexer| lexer.return_(Err(CustomError::InvalidHexEscape)),
}

The new rule ByteStringHexEscape matches two hex digits, and fails with InvalidHexEscape on everything else. Note that we don't want to match more than two characters here, so we have cases for end-of-stream ($), one character (_) and two characters (_ _). We can't do something like _* because that would match \xaaaa and fail with InvalidHexEscape.

(This is a case where a syntax for mathing between given numbers of occurrences would be useful, e.g. _{0,2} would expand to $ | _ | _ _. alex has this feature.)

It would be good to have a more concise way of failling with a given error. For example, in the definition of byte_escape:

let byte_escape = ("\\x" $hex_digit $hex_digit) | "\\n" | "\\r" | "\\t" | "\\\\" | "\\0" | "\\\"" | "\\'";

Maybe we could have something like:

let byte_escape = ("\\x" !CustomError::InvalidHexEscape $hex_digit $hex_digit)
            | "\\n" | "\\r" | "\\t" | "\\\\" | "\\0" | "\\\"" | "\\'";

where ! is the cut operator as described in #35, but when the match fails, instead of InvalidToken we now raise InvalidHexEscape.

One question is whether we also want a syntax for specifying the error value, without also adding a "cut". So far I didn't need this.

Implement "cut" (as in Prolog) for comitting to a choice

Suppose I'm trying to lex this invalid Rust code: b"\xa". The problem here is \x needs to be followed by two hex digits, not one.

If I run this with rustc I get an "invalid escape" error, as expected.

If I run this with lexgen_rust, I get an id b first, then an error.

The problem is with backtracking. The lexgen-generated lexer records the successful match for b as an identifier and continues lexing, to be able to return the longest match. When it fails to match the rest of the token, it returns b as an identifier.

Instead what we want to do is, when we see b" we want to "commit" to the byte string rule, i.e. no backtracking from that point. If the rest of the token is not a valid byte string then we don't return b as an id and fail.

This is trivial to implement once we come up with a syntax: just reset the last_match when we make a transitions with a "cut" (or "commit") annotation.

Currently the workaround is to have a lexer state for lexing the string body. So instead of this:

rule Init {
    ...

    "b\"" ($ascii_for_string | $byte_escape | $string_continue | "\r\n")* '"' => |lexer| {
        let match_ = lexer.match_();
        lexer.return_(Token::Lit(Lit::ByteString(match_)))
    },
}

We need something like:

rule Init {
    ...

    "b\"" => |lexer| lexer.switch(LexerRule::ByteString),
}

rule ByteString {
    ($ascii_for_string | $byte_escape | $string_continue | "\r\n")* '"' => |lexer| {
        let match_ = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Lit(Lit::ByteString(match_)))
    },
}

Since the idea is similar to Prolog's "cut", I suggest a similar syntax:

rule Init {
    ...

    "b\"" ! ($ascii_for_string | $byte_escape | $string_continue | "\r\n")* '"' => |lexer| {
        let match_ = lexer.match_();
        lexer.return_(Token::Lit(Lit::ByteString(match_)))
    },
}

That ! above is "cut" (or "commit"), meaning once b" is matched there is no backtracking, we either match rest of the string according to the current rule, or fail with an error pointing to the character b.

I wonder if other lexer generators have a syntax for this kind of thing?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.