tokay-lang / tokay Goto Github PK

View Code? Open in Web Editor NEW

232.0 4.0 7.0 1.58 MB

Tokay is a programming language designed for ad-hoc parsing, inspired by awk.

Home Page: https://tokay.dev

License: MIT License

Rust 99.84% Makefile 0.10% Awk 0.06%

programming-language compiler awk ad-hoc parsing packrat language

tokay's Introduction

Tokay

Tokay is a programming language designed for ad-hoc parsing.

Tokay is under development and not considered for production use yet; Be part of Tokay's ongoing development, and contribute!

About

Tokay is a programming language to quickly implement solutions for text processing problems. This can either be just simple data extractions, but also parsing entire structures or parts of it, and turning information into structured parse trees or abstract syntax trees for further processing.

Therefore, Tokay is both a tool and language for simple one-liners, but can also be used to implement code-analyzers, refactoring tools, interpreters, compilers or transpilers. Actually Tokay's own language parser is implemented in Tokay itself.

Tokay is inspired by awk, has syntactic and semantic flavours of Python and Rust, but also follows its own philosophy, ideas and design principles. Thus, it isn't compareable to other languages or projects, and is a language on its own.

Tokay is still a very young project and gains much potential. Volunteers are welcome!

Highlights

Interpreted, procedural and imperative scripting language
Concise and easy to learn syntax and object system
Stream-based input processing
Automatic parse tree construction and synthesis
Left-recursive parsing structures ("parselets") supported
Implements a memoizing packrat parsing algorithm internally
Robust and fast, as it is written entirely in safe Rust
Enabling awk-style one-liners in combination with other tools
Generic parselets and functions
Import system to create modularized programs (*coming soon)
Embedded interoperability with other programs (*coming soon)

Installation

By using Rusts dependency manager and build-tool cargo, simply install Tokay with

$ cargo install tokay

For Arch Linux-based distros, there is also a tokay and tokay-git package in the Arch Linux AUR.

Examples

Tokay's version of "Hello World" is quite obvious.

print("Hello World")

$ tokay 'print("Hello World")'
Hello World

Tokay can also greet any wor(l)ds that are being fed to it. The next program prints "Hello Venus", "Hello Earth" or "Hello" followed by any other name previously parsed by the builtin Word-token. Any other input than a word is automatically omitted.

print("Hello", Word)

$ tokay 'print("Hello", Word)' -- "World 1337 Venus Mars 42 Max"
Hello World
Hello Venus
Hello Mars
Hello Max

A simple program for counting words which exists of a least three characters and printing a total can be implemented like this:

Word(min=3) words += 1
end print(words)

$ tokay "Word(min=3) words += 1; end print(words)" -- "this is just the 1st stage of 42.5 or .1 others"
5

The next, extended version of the program from above counts all words and numbers.

Word words += 1
Number numbers += 1
end print(words || 0, "words,", numbers || 0, "numbers")

$ tokay 'Word words += 1; Number numbers += 1; end print(words || 0, "words,", numbers || 0, "numbers")' -- "this is just the 1st stage of 42.5 or .1 others"
9 words, 3 numbers

By design, Tokay constructs syntax trees from consumed information automatically.

The next program directly implements a parser and interpreter for simple mathematical expressions, like 1 + 2 + 3 or 7 * (8 + 2) / 5. The result of each expression is printed afterwards. Processing direct and indirect left-recursions without ending in infinite loops is one of Tokay's core features.

_ : Char< \t>+            # redefine whitespace to just tab and space

Factor : @{
    Int _                 # built-in 64-bit signed integer token
    '(' _ Expr ')' _
}

Term : @{
    Term '*' _ Factor     $1 * $4
    Term '/' _ Factor     $1 / $4
    Factor
}

Expr : @{
    Expr '+' _ Term       $1 + $4
    Expr '-' _ Term       $1 - $4
    Term
}

Expr _ print("= " + $1)   # gives some neat result output

$ tokay examples/expr_from_readme.tok
1 + 2 + 3
= 6
7 * (8 + 2) / 5
= 14
7*(3-9)
= -42
...

Tokay can also be used for programs without any parsing features.
Next is a recursive attempt for calculating the factorial of an integer.

factorial : @x {
    if !x return 1
    x * factorial(x - 1)
}

factorial(4)

$ tokay examples/factorial.tok
24

And this version of above program calculates the factorial for any integer token matches from the input. Just the invocation is different, and uses the Number token.

factorial : @x {
    if !x return 1
    x * factorial(x - 1)
}

print(factorial(int(Number)))

$ tokay examples/factorial2.tok -- "5 6 ignored 7 other 14 yeah"
120
720
5040
87178291200
$ tokay examples/factorial2.tok
5
120
6
720
ignored 7
5040
other 14
87178291200
...

Documentation

The Tokay homepage tokay.dev provides links to a quick start and documentation. The documentation source code is maintained in a separate repository.

Logo

The Tokay programming language is named after the Tokay gecko (Gekko gecko) from Asia, shouting out "token" in the night.

The Tokay logo and icon was thankfully designed by Timmytiefkuehl.
Check out the tokay-artwork repository for different versions of the logo as well.

License

Tokay is free software under the MIT license.
Please see the LICENSE file for details.

tokay's People

Contributors

Stargazers

Watchers

Forkers

phorward nivpgir leticia-maria ebell495 jgbyrne derekirish3 mayhemheroes

tokay's Issues

Parselets should allow for `*args` and `**nargs` catchall

This issue takes the *args and **kwargs attempt from Python. Yeah, this would be nice to have :-D

func : @a, b=2, *args, **nargs {
    print(a)
    print(b)
    print(args)
    print(nargs)
}

>>> func(1)
1
2
()
()
>>> func(1,2,3)
1
2
(3,)
()
>>> func(a=3, b=5, c=10)
3
5
()
(c => 10)

Use of `accept`, `reject` and `repeat` inside of a non-consuming parselet should return an error

Currently, this is allowed:

f : @{ 1 repeat; 2 }

a non.-consuming function uses parselet-only keywords. In general, these are accept, reject and repeat. The repeathas no effect in this example, as it doesn't consume any input, and this causes repeat to stop. The usage of these keywords is absolutely effect-less.

`value!`-macro doesn't allow for negative integers in lists

This works:

value!(-1)

This fails:

value!([-1, 1])

with

error[E0277]: the trait bound `refvalue::RefValue: From<[{integer}; 2]>` is not satisfied
   --> src/value/mod.rs:63:9
    |
63  |         $crate::RefValue::from($value)
    |         ^^^^^^^^^^^^^^^^^^^^^^ the trait `From<[{integer}; 2]>` is not implemented for `refvalue::RefValue`
    |
   ::: src/value/refvalue.rs:457:17
    |
457 |         Ok(Some(value!([-1, 1])))
    |                 --------------- in this macro invocation
    |
    = help: the following implementations were found:
              <refvalue::RefValue as From<&'static Builtin>>
              <refvalue::RefValue as From<&str>>
              <refvalue::RefValue as From<Box<(dyn object::Object + 'static)>>>
              <refvalue::RefValue as From<Dict>>
            and 13 others
    = note: this error originates in the macro `value` (in Nightly builds, run with -Z macro-backtrace for more info)

It looks like this has to do with the AST node types allowed from the macro definition.

Calling builtin with too many parameters fails assertion

>>> ord(1 2)
thread 'main' panicked at 'assertion failed: args.is_empty()', src/builtin.rs:142:1
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This is a wanted behavior, but also not wanted in user feedback.
The bug came in when the builtin call checker was substituted by the new generated builtins introduced with Tokay v0.5.

Builtin / default constants

With Tokay v0.5, the language is entering a phase where modularity becomes an important part.

For example, the new builtin native tokens Int and Float can be grouped by a pure Tokay parselet Number, defined as

Number : Float | Int

And together with #10, generic parselets are needed in various ways to provide tools for different use-cases.

Currently, there is no way to define such default parselets, except hacking directly in Compiler::get_constant() or Compiler::set_constant() as with the Whitespace defaults handling.

For sure, Float and Int could either be implemented in pure Tokay code on their own, but this may end in performance lose.

Ideas for this issue (partly from #10):

`1--2` produces misleading `Line 1, column 4: Expecting Lvalue`

Seems to be a grammar bug.

Tokay 0.4.0
>>> -2
-2
>>> 1--2
Line 1, column 4: Expecting Lvalue
>>> 1- -2
3

Expected behavior:

Tokay 0.4.0
>>> -2
-2
>>> 1--2
3
>>> 1- -2
3

ast() function looses correct overall offset.

The program

Integer ast("int")

executed on input hello1234or23so currently returns

(
   (col => 6, emit => "int", offset => 5, row => 1, "stop_col" => 10, "stop_offset" => 9, "stop_row" => 1, value => 1234),
   (col => 12, emit => "int", offset => 2, row => 1, "stop_col" => 14, "stop_offset" => 4, "stop_row" => 1, value => 23)
)

but due Tokay's main parselet reader reset, the offsets are incorrect. row and col values are correct as they are counted separately.

The correct return should be

(
   (col => 6, emit => "int", offset => 5, row => 1, "stop_col" => 10, "stop_offset" => 9, "stop_row" => 1, value => 1234),
   (col => 12, emit => "int", offset => 12, row => 1, "stop_col" => 14, "stop_offset" => 14, "stop_row" => 1, value => 23)
)

col and offset are equal in this example, but this is not the case when newlines are in the input.

Requirement for further expressional operators `|`, `&`, `^`, `**`, `//`

Discussion on further, eventually useful operators in expressions.

binary & (&=) and | |=
^ and ^= (xor)
** and **= (powers)
// and //= (integer division)

It should be discussed if these are really necessary.

Generic parselets

Parselets could be defined as generics when constant consumables are marked for being replaceable at a parselet's usage, which duplicates the parselet for the specific consumable.

Draft

The idea for generic parselets is to allow for replaceable constants defined with a special <...>-notation.
The constants are held and replaced during compile-time, and turned into specific VM instructions on demand of their usage.

Definition

Generic : @<P, Q: _, r: 0> { ... }'

Generic is a generic parselet with the replaceable constants P, Q and r, where P and Q are consumables, and Q has _-whitespace pre-defined, r has constant 0 pre-defined.

Usage

These are example calls.

Generic<Integer>
Generic<Integer, Q: ' '>
Generic<Q: '...', P: Integer, r: 10>

Builtin generics

Some generics should be available as built-ins, and could replace the current implementation eg of

`Until : @<P, Escape: Void>`

The Until-parselet could be a generic built-in parselet to parse data until a specific token or parselet occurs.

'"' Until<'"', Escape: '\\'> '"' parse strings like "Hello World" or with escape sequences like "Hello\nWorld"
Implement a String parselet as shortcut for above, like String: @<Start, End: Void, Escape: Void>
Until<Not<Char<A-Za-z_>]>> parse anything consisting not of Char<A-Za-z_>
Until<EOF> read all until EOF
#38 refers to a Line parselet for matching input lines

`Repeat : @ min=1, max=0`

This is a simple programmatic sequential repetition. For several reasons, repetitions can also be expressed on a specialized token-level or by the grammar itself using left- and right-recursive structures, resulting in left- or right-leaning parse trees.

Used by optional, positive and kleene modifiers.

Replacement for current repeat-construct.

`Not : @`

This parser runs its sub-parser and returns its negated result, so that an accept becomes
rejected and vice-versa.

Replacement for current not-construct.

`Peek : @`

This parselet runs P and returns its result, but resets the reading-context afterwards. It can be used to look ahead parsing constructs, but leaving the rest of the parser back to its original position, to decide.

Due to Tokays memorizing features, the parsing will only be done once, and is remembered.

Replacement for current peek-construct.

`Expect : @ msg=void`

This constructs expects P to be accepted. On failure, an error message is raised as Reject::Error.

Replacement for current expect-construct.

`List : @<P, Separator: ',', empty: true>`

Parse a separated list.

List : @<P, Separator: ',', empty: true> {
    Self Separator P
    if empty Self Separator   # allows for "a,b,c,"
    P
}

`Keyword : @`

Parses a keyword, which is a consumable not followed by any alphabetic letter.

Definition

Keyword : @<P> {
    P Peek<Not<Alphabetic>>
}

Example

Keyword<'if'>  # matches "if" in "if a...", "if(x)", "if." but not in "ifx"

User-defined generics

Example use-case is this grammar draft for Tokay's own REPL itself. It implements a generic Set parselet which can be used by different switches, allowing for feature enabling-/disabling like #debug, #debug on or #debug off.

# Grammar for REPL commands

_ : @{
    ' '
    '\t'
    '\n'
}

Switch : @<Ident> emit=void {
    if !emit {
        emit = str(Ident)
    }

    Ident _ 'on'            ast(emit, true)
    Ident _ 'off'           ast(emit, false)
    Ident                   ast(emit, true)
}


'#' {
    Switch<'debug'>
    Switch<'verbose'>
    Switch<'compiler-debug'>
    'run' _ Name            ast("run")
}

Source position not always available

Tokay 0.5.0
>>> l = list()
>>> l += 1
>>> l
list(1)
>>> l += 2
>>> l
(1, 2)
>>> l -= 1
Method 'list_sub' not found

Expected error message: Line 1, column 1: Method 'list_sub' not found

Inline-parselet used as function-parameter is not called

Current behavior:

$ tokay -- "Hello World"
Tokay 0.4.0
>>> str_upper(@{'Hello'})
("<PARSELET 94732966580648>", "<PARSELET 94732966580648>", "<PARSELET 94732966580648>", "<PARSELET 94732966580648>", "<PARSELET 94732966580648>", "<PARSELET 94732966580648>", "<PARSELET 94732966580648>", "<PARSELET 94732966580648>", "<PARSELET 94732966580648>", "<PARSELET 94732966580648>", "<PARSELET 94732966580648>", "<PARSELET 94732966580648>")
>>> str_upper(@{'Hello'}())
HELLO
>>>

Desired behavior:

$ tokay -- "Hello World"
Tokay 0.4.0
>>> str_upper(@{'Hello'})
HELLO
>>> str_upper(@{'Hello'}())
HELLO
>>>

Refers to #17 also.

Executing on windows requires manually fixing CRLF

Tokay programs fail to parse when trying to execute them on windows due to "expecting EOF".

I am able to work around this by manually converting "\r\n" to "\n" (using emacs but I assume dos2unix would work as well).

I have a hunch that fixing it would be rather simple (adding a case for "\r\n" somewhere around here and even tried to implement it myself, but failed to compile tokay, because of the same issue, after converting all *.tok files from the repo, I'm getting thread 'main' has overflowed its stack on every run of my locally built tokay (the one installed with cargo install tokay works fine) after which I stopped trying 😁.

any chance you can help/fix/support this?

Also, something else that I was wondering about (for a small parser I thought of implementing with tokay), is a string len function, it doesn't look like there's a builtin for it, and I'm not quite sure how to implement it with tokay itself. this is of course a different issue, but I didn't want to start spamming new issues 😅

Cool language! I hope it grows.

`print()` should accept for newline-parameter

print() should allow for a named parameter used to control the behavior if a newline is printed (default) or not after the output.

Input:

print("Hello World")
print("Hello", newline=false)
print("World")

Output:

Hello World
HelloWorld

*deref-Operator to avoid automatic calling values when they are directly callable

Currently, using a function in a sequence immediately calls it, if it doesn't own any required parameters, which is required for the need to avoid marking a function's use explicitly as a call.

>>> f : @{ print("Hello World") }
>>> f
Hello World
>>> g : @x{ print("Hello World2") }
>>> g
<parselet g>

To just access the function object in this case, a C-style deref-operator might be useful, like a preceding star *.

For clarification:

>>> f : @{ print("Hello World") }
>>> f
Hello World
>>> *f
<parselet f>

Empty `#` comments don't work

#Comment
#

123

isn't accepted: Line 2, column 1: Parse error, expecting end-of-file

Capture alias inferring, so `name => Name $n` $n shorthands $name?

This is only an idea.

When name => Name matches, the capture currently can only be accessed via $1 or $name. Alias inferring would also allow for $n, $na or $nam to match the capture $name. This might also be confusing, but also useful in some cases.

Modifiers for consumable repetition min-max

Likewise in regular expressions, .{2,4} matches any character either 2, 3 or 4 times, .{2} exactly 2 times, and .{,4} zero to 4 times.

Tokay could implement this as well for any consumable. A nice syntax should be found, as the regex-based syntax is not appropriate.

Here are some ideas how a syntax could look like. This requires that Tokay also supports some thing like slices.

'a'+2..4
'a'+2
'a'+..2

This syntax would be valid, as 'a'+2..4 is one syntactic unit, whereas 'a'+ 2..4 is a sequence of 'a' with unlimited positive modifier and a slice from 2..4.

Parser should use built-in tokens on appropriate positions

Tokay currently parses Integer and Float on its own inside parser.rs. It should be considered to use Tokay's own built-in tokens Integer and Float for that.

T_Float => Float
T_Integer => Integer

More might follow eventually.

Constants `Self` and/or `self` to self-reference current parselet

To refer the current parselet itself, as self keyword might be useful. This is also relevant for #10

@{ Self 'T' | 'T' }  # matches T TT TTT TTTTTTTTT...

Similar projects that can be of interest

I found some projects that have a similar objective as this one:

An experiment in language design, employing a dynamically exstensible LALR(1) parser. https://github.com/waywardgeek/not_rune
n addition to many other features it provides an extension mechanism. Seed7 supports the introduction of new syntax and their semantics into the language and it allows new language constructs to be defined using the Seed7 language itself. E.g.: Programmers can introduce syntax and semantic of new statements as well as user defined operator symbols. http://seed7.sourceforge.net/
Jeebox tries to fulfil the goal of describing everything https://github.com/gamblevore/jeebox
http://mbeddr.com/
A more descriptive text and a list of examples https://en.wikipedia.org/wiki/Extensible_programming

Proving an object for AST-nodes

Currently, AST nodes are being encoded as nested list-dict-objects and generated by the ast() built-in function.
It should be considered to represent AST-nodes directly as a Tokay built-in object.

The advantages are:

Memory-consumption will be lower, as an AST-node is represented by a native struct
ast() would serve as constructor
node.row, node.col, node.value and node.children could be considered as read-write-attributes
print() could be used to dump an ast-tree, which is currently done by ast_print()
The compiler's traversal code would become much easier to handle, because accessing elements is much easier

`(1,)+` is just silently accepted

The expression (1,)+ is currently parsed as a sequence of 1 and the pos-repetition operator.

main [start 1:1, end 1:6]
 sequence [start 1:1, end 1:6]
  op_mod_pos [start 1:1, end 1:6]
   sequence [start 1:1, end 1:5]
    value_integer [start 1:2, end 1:3] Str("1")

The correct interpretation would be:

(1,) shall be considered as a list with one item, therefore a list object is being generated
The pos-repetition operator is not valid in this case
Latter case may be heavy to resolve, but an error message should occur that no positive repetition on a list is allowed.

Implement builtin-tokens as character-classes

The latest commit 0ca7695 introduces builtin tokens by using a new token type Token::BuiltinChar and Token::BuiltinChars.

This might be fast, but also might lead in problems later, when merging character-classes like Whitespace + Uppercase should be made available. Therefore, implementing these built-in character-classes as real character-classes (Token::Char, Token::Chars) might be better.

The Rust standard library implements it here. Maybe it's worth a deeper view into this.

Provide an AST visualization method

For the user's manual of UniCC, I've used Batik, an XSLT-Script and a self-written tool chain to visualize parse trees from a grammar like this:

Add: Add '+' Mul | Mul
Mul: Mul '*' Integer | Integer

Add

and input 1+2*3+4 generating an SVG that can be turned into a PNG:

This example tree is generated from this call: ./mkast test.png "Add [ Add [ 1 Mul [ 2 3 ] ] 4 ]".

Such method can be implemented as part of the AST object (#32) and a method allowing to save the AST as an SVG.

Dict improvements

Dict currently is implemented as a BTreeMap.

This should be changed regarding #9 into the following behavior:

Use indexmap crate, keeping key insertion order (likewise in Python 3.5+) (done by a82a902
~~Allow for any RefValue as Dict key~~

Latter one is heavier, as a Hash must be implemented.

Redesign internal Tokay VM

The current implementation of the Tokay VM working on recursive structures with a bytecode-like flavor quickly overflows its stack when parsing, especially with block nesting.

The simplest overflow can be reproduced with eleven blocks on a REPL, so parsing {{{{{{{{{{{}}}}}}}}}}} overflows the stack immediately. In a multi-threaded environment like the test-cases, the program {{}} is enough to trigger the overflow. This is the reason for #5 which is closed now as it is not an issue with the test-cases.

Attached is a massif-profile for both the cargo test -- --nocapture kaputt and cargo run -- "{{{{{{{{{{{}}}}}}}}}}}" version (2,1 MB peak, 8,1 MB peak). This is a general architectural problem in the current implementation.

A better VM has to be created where stack usage is kept low all time, and most stuff is done on the heap.

Sequence and Rvalue syntax ambiguities

There are several conflicting ambiguities in the syntax of Sequence and Rvalue which result in undesired results.

list() (1, 2) is interpreted as call to list, who's result is called with 1 and 2. Desired result is a list() followed by a list 1, 2.
list().len(). is interpreted as list().len() and the "any character" Token identified as .. This syntax should generally prohibited, allowed should only be list().len() . with a delimiting whitespace in between.

Provide str_len(), dict_len(), list_len()

Also, something else that I was wondering about (for a small parser I thought of implementing with tokay), is a string len function, it doesn't look like there's a builtin for it, and I'm not quite sure how to implement it with tokay itself. this is of course a different issue, but I didn't want to start spamming new issues sweat_smile

Sure, and this was also possible with Tokay v0.4, but is currently broken in Token v0.5 as the entire built-in and object concept is being redesigned and modularized:

 $ tokay
Tokay 0.4.0
>>> "hello".len
5
>>>

I'll keep this in mind and will add a str_len()-builtin (which will than be callable like with tokay v0.4)

Originally posted by @phorward in #34 (comment)

Main parselet doesn't skip unicode input correctly

Works:

$ tokay . -- "Hello World"
("H", "e", "l", "l", "o", " ", "W", "o", "r", "l", "d")

Won't work:

$ tokay . -- "Hello Wörld"
thread 'main' panicked at 'byte index 1 is not a char boundary; it is inside 'ö' (bytes 0..2) of `örld`', ~/.cargo/registry/src/github.com-1ecc6299db9ec823/tokay-0.4.0/src/reader.rs:135:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

REPL creates a new main parselet for every prompt executed; Old mains stay until program end.

In the REPL, every line of code entered which is not a constant definition becomes the main-parselet for the next program run. Old main-parselets still remain in the constants pool, although they are not used anymore. This becomes visible when using TOKAY_DEBUG=1.

$ TOKAY_DEBUG=1 tokay
Tokay 0.4.0
>>> 1
main [start 1:1, end 1:2]
 sequence [start 1:1, end 1:2]
  value_integer [start 1:1, end 1:2] String("1")
0 => RefCell {
    value: Parselet(
        RefCell {
            value: Parselet {
                name: Some(
                    "__main__",
                ),
                consuming: None,
                severity: 5,
                signature: [],
                locals: 0,
                begin: [],
                end: [],
                body: [
                    Push1,
                ],
            },
        },
    ),
}
1
>>> 2
main [start 1:1, end 1:2]
 sequence [start 1:1, end 1:2]
  value_integer [start 1:1, end 1:2] String("2")
0 => RefCell {
    value: Parselet(
        RefCell {
            value: Parselet {
                name: Some(
                    "__main__",
                ),
                consuming: None,
                severity: 5,
                signature: [],
                locals: 0,
                begin: [],
                end: [],
                body: [
                    Push1,
                ],
            },
        },
    ),
}
1 => RefCell {
    value: Integer(
        2,
    ),
}
2 => RefCell {
    value: Parselet(
        RefCell {
            value: Parselet {
                name: Some(
                    "__main__",
                ),
                consuming: None,
                severity: 5,
                signature: [],
                locals: 0,
                begin: [],
                end: [],
                body: [
                    LoadStatic(
                        1,
                    ),
                ],
            },
        },
    ),
}
2
>>> 3
main [start 1:1, end 1:2]
 sequence [start 1:1, end 1:2]
  value_integer [start 1:1, end 1:2] String("3")
0 => RefCell {
    value: Parselet(
        RefCell {
            value: Parselet {
                name: Some(
                    "__main__",
                ),
                consuming: None,
                severity: 5,
                signature: [],
                locals: 0,
                begin: [],
                end: [],
                body: [
                    Push1,
                ],
            },
        },
    ),
}
1 => RefCell {
    value: Integer(
        2,
    ),
}
2 => RefCell {
    value: Parselet(
        RefCell {
            value: Parselet {
                name: Some(
                    "__main__",
                ),
                consuming: None,
                severity: 5,
                signature: [],
                locals: 0,
                begin: [],
                end: [],
                body: [
                    LoadStatic(
                        1,
                    ),
                ],
            },
        },
    ),
}
3 => RefCell {
    value: Integer(
        3,
    ),
}
4 => RefCell {
    value: Parselet(
        RefCell {
            value: Parselet {
                name: Some(
                    "__main__",
                ),
                consuming: None,
                severity: 5,
                signature: [],
                locals: 0,
                begin: [],
                end: [],
                body: [
                    LoadStatic(
                        3,
                    ),
                ],
            },
        },
    ),
}
3
>>>

Calling methods with self-only parameters directly

In Tokay v0.4, len was usable as an attribute of str, so both str.len() and str.len worked.
This is currently not possible anymore with Tokay v0.5, but should be considered.

Tokay 0.5.0
>>> s = "Hello World"
>>> s.len()
11
>>> s.len
<method builtin str_len of str object at 0x563e2b7594a8>
>>>

`Line` built-in

Line could be a built-in parselet accepting any line. The line-end should be depending on the used operating system (\r\n win,\r classic mac, \n unix/linux)

Supporting slices

The current main branch contains a version of Tokay which is already capable of doing this:

>>> s = "Hello World"
>>> s[2]
l

Now what if setting a character there is wanted?

>>> s[2] = "x"
thread 'main' panicked at 'not yet implemented', src/value/string.rs:56:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This currently fails, because it is not fully implemented.

Surely, s[2] = "x" should turn "Hello World" into "Hexlo World", but what happens when calling

s[2] = "xxx"

In this case "Hexxxlo World" should be the result.

Now one can think about also saying something like this:

s[2..3] = "xx"  # "Hexxo World"
s[2..3] = "x" # "Hexo World"
s[2..3] = "xxxx" "Hexxxxo World"

The same can be thought for lists as well.

Python, for example, doesn't allow for all of this for strings, but it seems to work with lists.

Unit tests (cargo test) fail with stack overflow

Recognized now with 4bb68df (same as previously in #2).
There seems to be a stack overflow when cargo test is ran. The problem can either be avoided by calling RUST_MIN_STACK=8388608 cargo test or by calling cargo test -- --test-threads=1.

running 20 tests
test test::test_capture ... ok
test test::test_compiler_identifier_naming ... ok
test test::test_compiler_structure ... ok
test test::test_begin_end ... ok
test test::test_collections ... ok

thread 'test::test_examples' has overflowed its stack
fatal runtime error: stack overflow
error: test failed, to rerun pass '--lib'

Caused by:
  process didn't exit successfully: `tokay/target/debug/deps/tokay-d3f64d47b82cddfb` (signal: 6, SIGABRT: process abort signal)

Refering to this it seems that something is really overflowing the stack somehow, and the thread's stack limit is exhausted.

`for...in`-loops & iterators

To implement the for...in-loop construct in Tokay, it is required to make iterators available for values like dict, list or string.

Stack overflow when inline adding int and list

Current 555e25e produces this error

Tokay 0.5.0
>>> a = 1
>>> a += (2,3)
>>> a

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
Aborted (core dumped)

This works:

>>> a = 1
>>> a = a + (2,3)
>>> a
(1, 2, 3)

wanted behavior is second, a should become a list (1, 2, 3).

Method call on Token object

Running tokay -- hello.

>>> [a-z]+
hello
>>> [a-z]+.upper()
Line 1, column 7: Method 'token_upper' not found
>>> [a-z]+ $1.upper()
HELLO
>>>

Assigned value to $0 should always have result precedence

The $0 value should always have a result-precedence in case it is assigned and no further push or higher keyword is being used.

This is the current behavior:

$ tokay "Word __ ''the'' __ Word \$0 = true" -- "Save    the   planet"
("Save", "the", "planet")

This is the wanted behavior:

$ tokay "Word __ ''the'' __ Word \$0 = true" -- "Save    the   planet"
true

3 < 10 is false? Value::Addr < Value::Integer PartialOrd mismatching

It seems there is a problem with a Value::Addr() and Value::Integer, as shown in the following example.

$ tokay
Tokay 0.4.0
>>> l = (1,2,3)
>>> l.len
3
>>> a = l.len
>>> a
3
>>> a < 10
false
>>> a > 10
true

This works:

>>> b = 3
>>> b < 10
true
>>> b > 10
false

Undefined parselet behavior with begin and end

With commit b6d81d5, the code

begin 1
2 3 4
end 5

produces (1, (2, 3, 4), 5). Anyway, this isn't the case when a parselet is not the main parselet

f : @{
    begin 1
    2 3 4
    end 5
}

f  # returns just "1"

(or short cargo run -- -d "f : @{ begin 1; 2 3 4; end 5 }; f")

This behavior generally has to be defined somehow, as it currently is a little bit undefined and confusing.

REPL main scope in compiler stays consumable even when the next prompt inserted unconsumable input

The main scope is not flagged as non-consuming when it was previously flagged consuming.

$ TOKAY_DEBUG=1 tokay
Tokay 0.4.0
>>> 'a'
main [start 1:1, end 1:4]
 sequence [start 1:1, end 1:4]
  value_token_touch [start 1:1, end 1:4] String("a")
0 => RefCell {
    value: Token(
        Touch(
            "a",
        ),
    ),
}
1 => RefCell {
    value: Parselet(
        RefCell {
            value: Parselet {
                name: Some(
                    "__main__",
                ),
                consuming: Some(
                    false,
                ),
                severity: 5,
                signature: [],
                locals: 0,
                begin: [],
                end: [],
                body: [
                    CallStatic(
                        0,
                    ),
                ],
            },
        },
    ),
}
>>> 1
main [start 1:1, end 1:2]
 sequence [start 1:1, end 1:2]
  value_integer [start 1:1, end 1:2] String("1")
0 => RefCell {
    value: Token(
        Touch(
            "a",
        ),
    ),
}
1 => RefCell {
    value: Parselet(
        RefCell {
            value: Parselet {
                name: Some(
                    "__main__",
                ),
                consuming: Some(
                    false,
                ),
                severity: 5,
                signature: [],
                locals: 0,
                begin: [],
                end: [],
                body: [
                    CallStatic(
                        0,
                    ),
                ],
            },
        },
    ),
}
2 => RefCell {
    value: Parselet(
        RefCell {
            value: Parselet {
                name: Some(
                    "__main__",
                ),
                consuming: Some(
                    false,
                ),
                severity: 5,
                signature: [],
                locals: 0,
                begin: [],
                end: [],
                body: [
                    Push1,
                ],
            },
        },
    ),
}
1

stdin should be default reader if no other stream is specified

Currently, stdin has to be explicitly set as input stream with tokay -- input.tok -- -, when no input file is specified. This behavior should be default.

At least, a wanted behavior could be just this invocation:

$ tokay 'print("Please enter your name:", newline=false) name => Line print("Hello " + $name)'
Please enter your name: John
Hello John

In current main, this already works:

 $ tokay 'print("Please enter your name:") name => Word print("Hello " + $name)' -- -
John 
Please enter your name:
Hello John
Please enter your name:
Please enter your name:
^C

Establishing `,` as list operator

In current Tokay v0.4, commas , are optional part of the syntax.

After some consideration, especially on #43, the idea came up that lists could generally by defined by separating commas, so the sequence 1 2 3 4 results in a list (1, 2, 3, 4), where 1 2, 3 4 actually could be interpretered as definitive list (2, 3) which is inside of a sequence, resulting in the list (1, (2, 3), 4).

This syntax is only allowed as SequenceItem or CollectionItem. Commas in parameter lists are still specified as they are, except when explicitly specified, so fn(1, (2,3), 4) provides 3 parameters.

$0 gets wrong start offset

Works:

$ tokay "Uppercase Lowercase+ \$0" -- "HelloWorld"
("Hello", "World")

Works not:

$ tokay "Uppercase Lowercase+ \$0" -- "aHelloWorld"
("Hello", "orld")
$ tokay "Uppercase Lowercase+ \$0" -- "aaHelloWorld"
("Hello", "rld")

Multiple variable assignment doesn't create self-contained values

This program

a = b = 10
a++
a b

outputs (11, 11) because a and b recieve the same value object. To fix this, all Store*Hold-operations need to clone the peeked TOS object first, but this seems to raise an unhandled problem in safe Rust or Rusts code optimization. The problem is currently stated here in detail, and an issue for Rust will soon be filed when the Tokay repository goes public.

`(1, )` should be parsed as a list, not as `(1)`

This issue relates to #35.

Tokay 0.5.0
>>> repr(list(1))
(1, )
>>> l = (1, )
>>> l
1
>>>

l should be (1, ) as it should be a list.

Rework value implementation

Tokay currently has a monolithic implementation of Values as an enum.

This enum is fine for the atomics and primitives (Void, Null, True, False, Integer, Float, Addr), but not for object-like or callable values.

Checklist (incomplete!)

This checklist is currently a sketch to define a goal for this issue. The concept as from Tokay 0.4, implementing object-related methods as prefixed built-ins (e.g str_upper() to allow for calling either str_upper("hello") or "hello".upper()) should remain and is a neat and viable feature.

Primitives

void and null stand on their own
bool(b) - Construct a bool
int(i) - Construct an int
float(f) - Construct a float
- t.trunc truncate fractional part
- t.fract return only fractional part
- t.ceil return next integer ceiling

String

List

list(v=void) - List constructor, also converts any non-list value into a list with one item
list_push(l, e, i=void) - Push e to end of l, or insert at index
list_pop(l, i=void) - Pop index i off l or remove from index
~~list_insert(l, i, e) - Insert e at index i in l~~
list_get_attr(s, a), e.g l.len
list_get_item(s, i), e.g. l[2], l[2..8]
list_set_item(s, i, v=void), e.g. l[3] = 42, l[3..5] = (7, 11, 9)

Dict

dict(l=void) - Dict constructor
dict_push(d, k, v) - Push v under k into d
dict_merge(d, o) merge dict o into d
dict_pop(d, k=void) - Pop key k off d (#75)
dict_get_item(d, k, d=void), e.g. d["x"]
dict_set_item(d, k, v=void), e.g. d["x"] = 1234

Callables

Token
Parselet
Builtin
Method

Resources

Unfortunately it is not obvious to implement these as trait objects, as Clone, PartialEq and PartialOrd is required, which break trait object safety.

Experimental playground on a possible solution
Also helpful resource

`ord('x')` causes stack overflow

Calling the ord()-function with a token parameter causes a stack overflow.

$ tokay
Tokay 0.4.0
>>> ord('x')

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
Aborted (core dumped)

This does not happen e.g. when calling chr('a').

Tabs not supported as whitespace

tab.tok

123 {
    456
}

fails with Line 2, column 1: Expecting "}" but

123 {
    456
}

works. First one uses a tab instead of four spaces.

Support for regular expression-based Tokens

After a longer consideration, tokens based on regular expressions should not be implemented currently, but maybe later.

The reason for this is, that Tokay implements the packrat-based parsing strategy on its own already, and adding regular expressions (which must operate on streams rather than strings) may break the current strategy in several situations.

Anyway, for further discussion and for collecting resources and ideas, this issue was opened and is recently updated and a place for discussions.

Existing Regex implementations

Some further information on existing regular expression implementations and their problems and benefits.

Rust's regex on text streams - currently not possible, same as rust-onig and fancy-regex
Hyperscan - Implements streaming regular expressions, could be an option
libphorward regex module - Currently written in C, Rust bindings could be implemented here to allow for a push-scanning method (see plexctx_-functions)

Regex conversion in Tokay

Tokay might be able to convert a Token expressed as a Regex into a Tokay Op-construct on its own. This would allow to express Tokay parsing constructs as Regex but use Tokay's parsing facilities directly.

Useful links and resources

https://swtch.com/~rsc/regexp/ - Regular Expressions explained by Russ Cox

tokay-lang / tokay Goto Github PK

tokay's Introduction

Tokay

About

Highlights

Installation

Examples

Documentation

Logo

License

tokay's People

Contributors

Stargazers

Watchers

Forkers

tokay's Issues

Draft

Definition

Usage

Builtin generics

Until : @<P, Escape: Void>

Repeat : @<P> min=1, max=0

Not : @<P>

Peek : @<P>

Expect : @<P> msg=void

List : @<P, Separator: ',', empty: true>

Keyword : @<P>

User-defined generics

Checklist (incomplete!)

Primitives

String

List

Dict

Callables

Resources

Existing Regex implementations

Regex conversion in Tokay

Useful links and resources

Recommend Projects

Recommend Topics

Recommend Org

`Until : @<P, Escape: Void>`

`Repeat : @<P> min=1, max=0`

`Not : @<P>`

`Peek : @<P>`

`Expect : @<P> msg=void`

`List : @<P, Separator: ',', empty: true>`

`Keyword : @<P>`