zetavm / zetavm Goto Github PK

View Code? Open in Web Editor NEW

631.0 46.0 41.0 1.56 MB

Multi-Language Platform for Dynamic Programming Languages

Home Page: https://pointersgonewild.com/category/zeta/

License: BSD 3-Clause "New" or "Revised" License

PLSQL 35.92% M4 0.19% Shell 1.31% C++ 51.40% Python 9.34% JavaScript 1.81% Dockerfile 0.03%

vm interpreter jit jit-compiler language

zetavm's Introduction

ZetaVM

Please note that ZetaVM is currently at the early prototype stage. As such, it is incomplete and breaking changes may happen often.

Requirements:

GNU Make
GCC 5.4+ (Linux) or clang (OSX), or cygwin (Windows)
Optional: autoconf and pkg-config, if needing to edit the configure file
Optional: sdl2, if wanting to use audio and graphics capabilities
Optional: Python 2 is needed to run the benchmark.py script

Installation

# Clone this repository
$ git clone [email protected]:maximecb/zetavm.git

# Run the configure script and compile zetavm
# Note: run configure with `--with-sdl2` to build audio and graphics support
$ cd zetavm
$ ./configure
$ make -j4

# Optionally run tests to check that everything works properly
$ make test

Basic Usage

# To run programs, pass the path to a source file to zeta, for example:
$ ./zeta benchmarks/fib.pls -- 29

# To start up the Plush REPL (interactive shell),
# you can run the Plush language package as a program:
$ ./zeta lang/plush/0

About ZetaVM

ZetaVM is a Virtual machine and JIT compiler for dynamic programming languages. It implements a basic core runtime environment on top of which programming dynamic languages can be implemented with relatively little effort.

Features of the VM include:

Built-in support for dynamic typing
Garbage collection
JIT compilation
Dynamically growable objects (JS-like)
Dynamically-typed arrays (JS/Python-like)
Integer and floating-point arithmetic
Immutable UTF-8 strings
Text-based code and data storage format (JSON-like)
First-class stack-based bytecode (code is data)
Built-in graphical and audio libraries
Coming soon: built-in package manager

Zeta image files (.zim) are JSON-like, human-readable text files containing objects, data and bytecodes to be executed by ZetaVM. They are intended to serve as a compilation target, and may contain executable programs, or libraries/packages.

More Information

A recording of a talk about ZetaVM given at PolyConf 2017 is available.

For more information, see the documentation in the docs directory:

There are also a few blog post about Zeta and its design.

For additional questions and clarifications, open a GitHub issue and tag it as a question, or join the ZetaVM Gitter chat.

zetavm's People

Contributors

Stargazers

Watchers

zetavm's Issues

Audio output API (SDL2)

I would like to add SDL2 audio support to Zeta. This would be exposed through a core API that does not expose any of the details of SDL. A new "core/audio" package would be created. See the vm/core.cpp source file.

I would prefer that the API be polling-based rather than callback based (SDL2 does provide an API for this). You would simply pass an array of samples to be written, and you could call a function to know the current buffer state (how many samples haven't been played back yet).

Code sample for SDL2 audio playback with polling/queueing:
https://skia.googlesource.com/third_party/sdl/+/refs/heads/master/test/loopwavequeue.c

SDL wiki reference:
https://wiki.libsdl.org/SDL_QueueAudio

What I have in mind is an API that allows you to output either mono or stereo at a fixed 44100Hz sample rate. The goal is really to make audio playback as simple and newbie-proof as possible. The samples would be provided as an array of float32 samples that would be interleaved, ie:

var audioData = [chan 0 sample 0, chan 1 sample 0, chan 0 sample 1, chan 2 sample 1, ...]

var audio = import "core/audio"
var dev = audio.open_output_device(num_channels);
audio.queue_samples(dev, audioData);

// Returns the number of queued samples not yet sent to the hardware
var numSaples = audio.get_queue_size(dev);

Recursive string.join function

Currently, the join function in the std/string library uses successive string concatenations to assemble the output string. This is problematic because the total running time is O(n^2). I think we could get a much better running time by implementing a divide and conquer algorithm instead, which splits the left and right string arrays in half and performs the join operation recursively. A use case for this is when concatenating strings to output a text file, which does currently have very poor performance.

Subprocess support for scripting

In order to be useful for minimal shell/scripting-type tasks zeta should provide the ability to create & control subprocesses.

A reasonable first implementation would:

have a synchronous API
be implemented with or based on the popen function: http://man7.org/linux/man-pages/man3/popen.3.html
have at least one function which accepts a command to run with stdin and/or stdout being relegated to the parent (zeta) process
have at least one function which accepts a command to run with stdin and/or stdout being provided as arguments (pointers to strings, buffers, etc)

Note that this API has a lot of potential for yak-shaving as the low level will be pretty awkward to use. The bare minimum to allow some scripting should be completed first, later when the standard lib is more fleshed out with allocation/concurrency/string functions more convenient wrappers can be created.

Higher level functions may include:

a function which accepts a command to run and data to pipe into it's stdin
a unction which accepts a command to run and a callback function which will receive the output of the command
etc

Creating the configure executable doesn't work

Command to create the configure executable:

autoconf configure.ac > configure

The error I get when running it:

./configure --with-sdl2
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
./configure: line 2282: syntax error near unexpected token `SDL,'
./configure: line 2282: `    PKG_CHECK_MODULES(SDL, sdl2 >= 2.0.0, CXXFLAGS="${CXXFLAGS} -DHAVE_SDL2")'

Plush command-line option parser

ZetaVM has an option parser, written in C++, which was contributed by ashwanidausodia. This option parser handles the command-line options of the VM itself.

However, programs running on ZetaVM can receive their own options, eg:

# Options after the double dash are passed to the program being run by ZetaVM
./zeta myprogram -- opt1 ... optN

Currently, the options passed to programs running on ZetaVM are handled in an ad-hoc manner. See this program for an example.

It would be desirable for us to provide an std/options library to parse command-line arguments. This library should be written in Plush. We may want to try and have an API similar to the C++ option parser in ZetaVM, for consistency, though this isn't strictly necessary.

Note: there are already a number of useful functions in the std/string and std/array libraries which may be useful in implementing the option parser.

Floating-point parsing support in the std/parsing library

A little while back, I spun off some of the lower-level parsing functions in the Plush parser into their own parsing library, so that other languages could reuse them. These functions basically do lexing, parsing of identifiers and string literals, things that are not overly specific to the Plush language.

One feature that is missing from the parsing library is the ability to parse floating-point numbers. I wrote a parseInt function, because that's relatively straightforward, but parsing numbers in general is trickier. The main reason why it's trickier is that all numbers start with a digit, and you don't know, when you begin parsing the number, if you are parsing a float or not.

The current Plush implementation solves this problem by accumulating the digits in a string. This, however, is kind of bad, because it causes a series of successive string concatenations. I would like to have an implementation that avoids this.

IMO, the best strategy is probably to note the current index in the input string being parsed when starting to parse a number, and scan until either a period is hit, or the end of the digits are found. Then get the chars as a substring, and delegate the actual parsing to the $str_to_i32 or $str_to_f32 instructions. Possibly, integer parsing could be done directly, to completely avoid the allocation caused by the substring call.

There are other details to keep in consideration, such that some values might be too large to be parsed as valid integers. This should result in an exception being thrown (see parseError function in the parsing library).

Lastly, I would like, for now, to force floats to end with the f char. For example, 3.2f. This is because we currently only support 32-bit floats, but may eventually support doubles.

Improving User-Friendliness

I think there are probably many small things we can do to make Zeta more user-friendly. You may be getting cryptic error messages, running into strange bugs, segmentation faults, or odd behaviors.
This is the place to report these kinds of problems. Small issues like these often go unreported, and may discourage people from using Zeta. By telling us about them, you can help us fix them, add new tests, and ensure that other users don't run into the same problem.

String library (std/string)

The core VM doesn't provide very much in the way of string functionality. Just support for immutable UTF-8 strings with string interning. It seems it might be beneficial to write a standard library that has a collection of useful string functions. This is particularly relevant since ZetaVM is a tool for language implementation, and we necessarily do a fair bit of string manipulation as part of parsing.

This library should be written in Plush, and the package should go under std/string/0.

See the way the math package is compiled for reference:

zetavm/makefile.in

Line 73 in 9b944ed

math-pkg: $(CPLUSH_BIN) plush/parser.pls

Things we should probably include

toString: function to convert values to a string
indexOf: finds the first index of some substring
substr: get a substring
split: as in Python, split string based on some delimiter, produces an array
toLower, toUpper: convert to upper or lowercase
strip: remove whitespace, can also be subdivided into lstrip and rstrip
replace: replace all occurrences of some substring

These don't have to be the only functions we include, suggestions welcome. Also, if you're interested in working this library, note that you don't have to implement all the functions at once, an implementation of just a subset of these would already be valuable.

Plush floating-point number format

I was writing some code the other day and it occurred to me that we should really force floating-point numbers in plush to have the same trailing "f" as in zeta image files.

eg: 3.56f, instead of just 3.56

The reason is that we may eventually add float64 support, and if we do that, they will need to have a different literal format. Hence, I would rather we follow the same convention as C and have float32 values have the trailing f, and float64/double values not have it.

Note: we should add parser tests for this. It's now possible to use testParseFail in plush/tests/parser.pls to check that certain strings should not parse, so we can verify that 3.56 does not parse.

std/time needs a get_local_time function

The std/time library could use a function to get the local time. One potential use for this, besides knowing the time, is as a random seed for the random number generator.

The std/time library is implemented in vm/packages.cpp. I think we should just use the ctime functions to implement this.

The get_local_time function should return an object with fields based on the C tm struct. However, I would rename the fields to make them more intuitive, eg: year, month, day, week_day, year_day, is_dst. I also propose that we do not include tm_year (years since 1900) because that is retarded.

Trying to implement a language - Marco

I'm opening this issue to have a place where I can ask questions that could also be relevant to people in future trying to do similar things to me while implementing a language with zetaVM.
For other less relevant issues I may contact @maximecb in private (I sent you a request on Hangouts) in order to reduce the spam of silly questions here.

Yesterday I was trying to come up with some ideas about the language I want to implement. While doing so, I stumbled upon one problem that I can't solve in my head.

The problem: I want the ability to get all the fields in an object, so that I can iterate over them or create a new deep copy of the object. I know that this is possible inside the VM, but as far as I know there is no instruction that gives me this ability. Would it be possible to implement this with the instructions that already exist or should we add a new instruction to zeta for this?

Float Constant PI

I was thinking that maybe having pi as a float constant could simplify the creation of languages on top on zeta. Am I wrong?

Map library (std/map)

There is a need for a hash map library, particularly given that the core VM doesn't support the use case of objects being used as dictionaries. I think that this can be made fairly efficient once we have a JIT compiler. It's something I would prefer we implement in userland rather than in the VM itself. There are logical reasons to do this, the main one being that hash maps require hash and equality functions. Hence, if the implementation is in the VM itself, there is a need to call into user functions to use custom hash functions, which would be inefficient.

This library should be written in Plush, and the package should go under std/map/0.

I've written an implementation of a hash map in JavaScript as part of the Higgs compiler project. This can serve as a template/inspiration for how to structure the code for this: https://github.com/higgsjs/Higgs/blob/master/source/stdlib/map.js

Sidenote: Plush allows defining prototypal inheritance with a syntax of the form ProtoObj::{ properties}. This is exemplified in the Parsing library:

zetavm/plush/parsing.pls

Line 498 in 7c7459e

var input = Input::{

Suggestions wrt how the map API should be designed are welcome.

Timing functions, time library (std/time)

In order to do support animations, games, or do benchmarking, we're going to need to support some timer/timing functions.

Right now, we're limited to 32-bit floats and 32-bit integers, so we can't really do "time since unix epoch", but we could expose a timer function that produces the time since the process started. Because of the limited precision of 32-bit floats, it would probably make the most sense to return this quantity in terms of milliseconds. Future extensions of Zeta, once we have int64 or float64 types, can add more sophisticated timing functions.

AFAIK, the C time functions have terrible precision on windows (less than millisecond), but that could be acceptable for now. Another alternative would be to use performance counters on windows.

Suggestions/feedback on this welcome. Suggestions for other useful time functions also welcome.

Random number generator library

It would be useful to have a random number generator library. Something we could use as follows:

var random = import 'std/random';
var rng = random.newRNG(seedVal);

// Generate an integer in [a,b]
var intVal = rng.int(a,b);

// Generate a float in [a,b]
var floatVal = rng.float(a,b);

// Generate a random index in [0, len[
var idx = rng.index(len);

Would also be useful to have a method to select a random element in an array, and to generate random booleans, etc.

I've written this C++ code a while back which could be ported, serve as a base for the implementation: https://gist.github.com/maximecb/617de45a99347a9911b1e0d974da5d62

Float support for math.pow, missing exp and log instructions

I was writing some audio synthesis code, and porting over some filter code I had written a while back. Unfortunately, the said filter relies on the math.pow(x,y) function accepting floating-point values, and right now we only support integers. I would like to fix that gap.

I think a good solution, for now, may be to expose the C log(x) and exp(x) functions through new log_f32 and log_f32 instructions, and use those to implement support in math.pow(x,y), at least for positive floating-point values. We can throw an exception for negative floating-point exponents or bases. If you're willing, you can try to support the whole range of possible values, but some care has to be taken in determining which values should throw exceptions, and which ones should produce NaN as output.

To implement the new instructions, you should look in vm/interp.cpp. The math library code is in plush/math.pls. Tests should go into tests/plush/math.pls.

Comments and feedback welcome. Note that this is needed relatively soon (so I can continue writing my code! :D).

SDL Create Windows is not working (in plush)

I was trying to use run the draw_font.pls in the benchmarks folder, but it doesn't seem to work.

EDIT:
It seems like the create_window function in cpp get passed the wrong type of values.
I'm not sure what's going on.

Consider changing to zlib license for less problematic inclusion into binaries

This project appears to be a bytecode VM aimed at any sort of programming language developer seeking a backend. Now before you shoot me, yes the current license is already very liberal and generous. However, there is one point that is a bit problematic:

From my impression (disclaimer: I have just discovered this project) the regular use case appears to be 1. I write a compiler targeting the VM, 2. I write some sort of tool which combines the VM and the bytecode into a single binary. (since unless ZetaVM and/or my own language as a language developer becomes widely popular, this is the only way to make a self-contained binary that is likely to run easily on a wide choice of target systems)

Now given this is the default deployment, this means for compiling a single binary of a hello world, the end user (the programmer using the language to make that hello world program) would already need to include an attribution line into the hello world program for the ZetaVM compiled into it. The reality is that most people wouldn't expect such a requirement purely from using the standard library & standard compiler, and most likely also won't do it in practice and hence violate the terms without being aware of it.

Now of course you could ignore that and most likely nobody is going to get sued over it anyway, but for a component that is likely to be included into the final binary even for the most basic programs, a license like the zlib license (which requires attribution in source code and to not lie about who created the software, but drops the attribution requirement in any sort of binary) might be more suitable.

Feel free to disagree and/or ignore this since attribution is obviously your good right as a creator, I just wanted to bring attention to this potential issue with how it might up being used in practice.

Spinning out the Plush runtime as a standalone library (std/runtime) ?

This issue is to discuss a potential idea.

One of the things I'm working towards is fully bootstrapping the Plush implementation. There are currently two implementations of the language. The cplush one, which is written in C++, and the self-hosted Plush language package (in plush/plush_pkg.pls).

I've been slowly working on giving the VM the ability to serialize code into ZIM files, which would enable us to write compile code into ZIM files that don't need to be parsed in order to run on Zeta. At this point, the Plush package is already able to parse itself in-memory. It's actually surprisingly fast, only taking about 1.4 second on my laptop.

I did run into a snag though, which is that currently, cplush bakes the Plush runtime (plush/runtime.pls) into compiled files. The Plush package then makes use of the baked functions directly. This doesn't really work when the package parses itself. What would make more sense, it seems, is to have the Plush runtime be its own package.

This brings me to a question which I would like your opinion about: where should I put the Plush runtime package? One possibility is that I could directly put it into std/runtime. This runtime library could become not just the Plush runtime, but a collection of useful runtime functions which multiple languages can use.

However, there is a risk that no matter what, the library will always remain very Plush-specific. Hence maybe it shouldn't be named std/runtime. Possibly, this should be a sub-package of lang/plush, but we currently do not really have support for these. It's not something I've given much thought to. Possibly, we can simply allow packages to use the current versioning scheme, but to live within the path of other packages. As such, there could be a lang/plush/runtime/0. This would be versioned independently of lang/plush/0.

Trying to think about the future development of our ecosystem, having sub-packages may be inevitable. If we think about having a Lua implementation, for instance, we will need to implement the Lua standard library. This will have to live within one or more packages as well. It might make more sense for such packages to live within lang/lua/stdlib/* rather than under lang/lua-stdlib.

Discussion: Add the 'char' primitive type

Because there is no char primitive type, it is not possible to efficiently implement characters (without wrapping them into objects) for a language targeting ZetaVM.
This is suitable for languages like JavaScript, but not for languages using two different types for strings and chars (e.g. Scheme).

If a char primitive type is added:

The get_char instruction should return a char instead of a string.
To keep the IR concise, the implementation can convert it to a string.
Another option is to keep two instructions (e.g. get_char and get_char_str) to simplify language implementations.
Char comparison need to be added.
A single instruction to convert from a char to an int64 allow the implementation to do chars comparison.
Another option is to add explicit IR instructions to do char comparison (e.g. lt_char, le_char, ...)

Travis being slow.

Today, just to start building 2 PRs ( #49 #50 ) travis took almost 3 hours, and this is not the first problem it gives us (when we moved repo it stopped building). Ideally we would like to have our CI to be as stable and fast as possible.
Should we try CircleCI? I could try to set it up in my repo and then for a first period we could have both CIs testing our code to see which one is the best one.
In the meanwhile I could also integrate AppVeyor (https://www.appveyor.com/) that will allow us to build and test on Windows!

Clarify mapping of stack arguments to locals.

At the moment, it's rather unclear, even when reading the source code, which local variable corresponds to which function argument. For example, is the closure argument the first or last argument? If my params field is ["x", "y"], which index is assigned to each?

I'd understand if this is subject to change, but the lack of documentation is a minor stopping point for people trying to write their own languages.

Compilation errors on Cygwin GCC 5.4.0

The following code in runtime.cpp:
/// Undefined value constant
const Value Value::UNDEF(Word)0LL), TAG_UNDEF);

/// Boolean constants
const Value Value::FALSE(Word(0LL), TAG_BOOL);
const Value Value::TRUE(Word(1LL), TAG_BOOL);

/// Numerical constants
const Value Value::ZERO(Word(0LL), TAG_INT32);
const Value Value::ONE(Word(1LL), TAG_INT32);
const Value Value::TWO(Word(2LL), TAG_INT32);

Fails compilation with error:
call of overloaded ‘Word(long long int)’ is ambiguous

Seems the compiler can't make the link between int64_t and long long.
When I add casts to int64_t it works. Perhaps adding ULL instead of LL works too.

Printing arrays

I attempted to print an array in plush and I get the following error message:

aborting execution due to error: unhandled type in equality comparison: array

I assume printing arrays isn't implemented in the print function?
An example of this is:

var someArray = [1,2,3,5,7];
print(someArray); <--- fails
print(someArray[0]); <--- works fine.

Missing std/string and std/math functions

I'm working on making the benchmarks under /benchmarks parameterizable with some size argument on the command-line, so that we can quickly run all of the benchmarks as part of make test. In the process of doing this, I noticed that at least two useful string and math functions are missing.

We're missing a math.pow(x, y). At the very least, we should support this for positive integer bases and exponents to begin with. Ideally, we would support floating-point values too, but this might require adding log and exp instructions. Advice welcome.

In the string library, we could use some left-padding and right-padding functions, to pad strings up to some given width. I would just call these string.lpad and string.rpad.

We currently have a string.parseInt function, but we're missing string.parseFloat. The simplest solution would be to use the $f32_to_str instruction to do the parsing. The format of the input needs to be validated. That can be done with a state machine.

math.pow(x, y)
string.lpad(str, char)
string.rpad(str, char)
string.parseFloat(str)

Note: if you're interested in contributing, I will accept a PR for any of these functions. It's not required that you implement all of them.

Implementation of a JavaScript front-end

It would be great to have support for JavaScript, or a subset of it. This would allow us to use existing JS code. The Plush language is quite similar to JS, so the Plush parser could be used as a base to begin implementing JS support.

Image Serialization

ZetaVM uses a text-based file format (ZIM files) to store code and data. This file format resembles JSON, with the important difference that it can represent arbitrary graphs of objects. Zeta needs to have not only the capability not only to read graphs of objects from ZIM files, but also to reverse this process, and serialize graphs of objects into strings or new ZIM files.

The ability to serialize objects into ZIM files will make it easy for languages which run on ZetaVM and generate bytecode in memory to save the generated code into a compiled file. It may also make it possible for programs to suspend their own state to disk and resume execution later on (with some limitations). Lastly, being able to read and write ZIM files opens up the possibility of using this format to easily save data, and to use it to transmit data over the internet.

The code used to parse ZIM files is found in vm/parser.cpp. Code for serialization could be written under vm/serialize.cpp. I propose that we implement a function that starts from a root object or value, traverses the graph of reachable value, and outputs a string represented the serialized image in ZIM format. Serializing to a string instead of directly writing to a file has the advantage that we could use this for network transmission of data as well.

One potential issue is that there are values which cannot be serialized. Namely, host functions (hostfn tag) and values tagged as raw pointers (rawptr), these can be things like open file handles or output devices handles (eg: audio interface). For the moment, I propose that we simply throw an exception if such values are reached. This would make it up to the person producing the data to be serialized to ensure that no such values are present. This should not be an issue when serializing compiled code.

For testing, we should check that round trips are possible and correct. That is, that we can serialize data, then load it back, and get back the same thing. We could also serialize data, load it, serialize it again, and check that both serialized outputs are identical. Testing should be thorough, because this is a core feature of the VM.

Contributing to the project

I'm opening this issue because I would love to have the opportunity to contribute to this project.
I haven't seen any particular mention in the README, so I don't know if contributions are welcome at this point of the project.
I'm not even sure if opening an issue on GitHub is the right thing to do, but I would be very happy to contribute in any way, from writing documentation, creating a sample language on my own, to actually working on the vm itself (which, honestly, I would actually love).

Command-line argument parsing

I think that we will eventually need to have command-line argument parsing for ZetaVM. This will be needed in order to set various parameters (eg: to disable the JIT compiler, or set the initial heap memory size). I'm kind of allergic to external dependencies, so I would rather we roll our own.

I kind of like the way V8 does it, which is a syntax of the form:

zeta --option_name=<option_value> --opt2=<val2> <program_file> -- <prog_arg0> ... <prog_argn>

For example:

zeta --nojit --init_heap_size=2048 sumthenumbers.pls -- 1 2 3

The arguments past the double-dash -- are stored in an array and passed to the running program's main function.

Core dump seen on 32 bit machine

Assertion seen while running Zeta
Below is the stack trace for ./zeta --test

(gdb) r --test
Breakpoint 2, testRuntime () at vm/runtime.cpp:763
763 {
(gdb)
764 std::cout << "runtime tests" << std::endl;
(gdb)
764 std::cout << "runtime tests" << std::endl;
(gdb)
runtime tests
789 assert (arr.getElem(0) == Value::ZERO);
(gdb)
792 auto arr2 = Array(0);
(gdb)
793 assert (arr2.length() == 0);
(gdb)
794 arr2.push(Value::ONE);
(gdb)
zeta: vm/runtime.cpp:120: uint8_t* Wrapper::getNextPtr(refptr, refptr): Assertion `nextPtr != nullptr' failed.
Program received signal SIGABRT, Aborted.
0xb7fdd428 in __kernel_vsyscall ()
(gdb) p Value::ONE
$1 = {word = {float32 = 1.40129846e-45, int64 = 1, int32 = 1, int8 = 1 '\001', ptr = 0x1 <error: Cannot access memory at address 0x1>},
tag = 2 '\002', static ZERO = {word = {float32 = 0, int64 = 0, int32 = 0, int8 = 0 '\000', ptr = 0x0}, tag = 2 '\002',

Machine and compiler details are below

universe@universe-Lenovo-G50-80:~/zetavm$ uname -a
Linux universe-Lenovo-G50-80 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 11:56:31 UTC 2017 i686 i686 i686 GNU/Linux
universe@universe-Lenovo-G50-80:~/zetavm$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/i686-linux-gnu/4.8/lto-wrapper
Target: i686-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.4-2ubuntu1~14.04.3' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-i386/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-i386 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-i386 --with-arch-directory=i386 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-targets=all --enable-multiarch --disable-werror --with-arch-32=i686 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=i686-linux-gnu --host=i686-linux-gnu --target=i686-linux-gnu
Thread model: posix
gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)

Missing math library features

I was working on code to do sound/music generation the other day, and I'm missing a few features that probably should be in our math library (plush/math.pls).

It would be useful to have functions such as floor, ceil, round and fmod, which work as in the C library. I believe it should be possible to implement these without introducing new instructions in the VM. The functions should be tested with negative, positive, zero and other integer values to make sure there are no edge cases.

EDIT: @tommyettinger points out that we also may want to add a math.isNaN(x) function, and an infinity constant, math.INFINITY.

EDIT 2017-06-21: math.INF, math.isNaN, math.floor and math.fmod are now done

Implementation of a fast interpreter

The current interpreter, which directly executes the object-based Zeta IR (ZIR), is unfortunately dog slow, in large part because traversing a graph of dynamic objects can't really be made fast. I would like to start writing a new interpreter which lazily compiles the ZIR into a compact, linear, flat internal IR with variable instruction length.

This interpreter will be based on the Basic Block Versioning (BBV) compilation technique I developed, and will (eventually, not at first) do some type-specialization. Other optimizations to be added may include merging very common op sequences into macro-ops to reduce dispatch overhead, and threaded interpretation. Simply put, the interpreter will do some amount of JITting, but instead of JITting to machine code, it will JIT into its own more optimized internal IR, which we have a fast interpreter for. This could also be used to simplify the writing of an actual JIT to machine code down the line. A simple and lightweight JIT could conceivably JIT-compile the internal IR to machine code. The interpreter itself will remain maintainable and highly-portable.

Summary of desirable features:

Lazy compilation of object-based ZIR into internal flat bytecode IR (IIR)
IIR will have variable length instructions for compactness
IIR will be more optimized, have less branches
Lazy (re)allocation of stack frame slots for compact stack frames (after MVP)
Threaded interpretation of IIR (after MVP)
Fusion of very common ZIR op sequences into IIR macro-ops to reduce dispatch overhead (after MVP)
Elimination of dynamic type checks (later, after MVP)
Incremental inlining (for later, not at first)

The first prototype will be kept intentionally basic and simple (MVP), and will make some shortcuts, such as allocating fixed-size sack frames only. The first prototype will not be a threaded interpreter, and will not do type-specialization.

Milestone: I'm presenting at PolyConf in July 2017. I would like to have the new interpreter working before then, because the current performance level is going to turn a lot of people off. Right now, Zeta VM runs about 200K instructions per second. I would like to reach 100MIPS, maybe more. This will make it possible to use Zeta to run a wide variety of applications, including simple games.

Plan of action: I will begin laying down some code for the MVP shortly. The MVP will use BBV, but will do no type-specialization, no op fusion, and will use a switch case for dispatch. It will also use a fixed-size stack frame for simplicity. I anticipate that the MVP will still perform an order of magnitude or two better than the current interpreter.

Stack traces on exceptions & assert

Currently, every call site has an src_pos attribute which we can report in the stack trace.

@krypt-n pointed out that we may want to make sure to set the name attribute on every function, as much as possible. This will have to be handled at the parser level.

Another issue we need to discuss is that currently, assert maps to the abort instruction in the VM, which halts program execution. We have to decide whether we want to make assert simply throw an exception as in Python or keep the current behavior. The current behavior is based on the idea that assert is for circumstances which we really expect to be impossible, for sanity checks, whereas exceptions are for errors which we anticipate people could trigger.

If we make assert simply map to exceptions, things may be simpler, implementation-wise, because we wouldn't have to add stack traces to abort. However, if we keep the abort instruction, we probably want stack traces anyway, so there is also a discussion to be had as to whether that instruction should be kept or not.

Floating-point support in the self-hosted Plush parser

We now have floating-point support in the VM, and in the C++ Plush compiler (cplush). However, we're still missing support for parsing floats in the self-hosted Plush parser (plush/parser.pls). This is a big missing piece, because the self-hosted parser is the preferred way to write Plush code. It's likely that eventually, the cplush compiler will be deprecated. At the very least, this parser won't be as updated and tested as the self-hosted one.

Adding support for floats is mostly a matter of implementing parsing for float literals in parser.pls. The str_to_float instruction can be used to produce floating-point values from strings. There also needs to be a FloatExpr node type added, with accompanying codegen.

Interpreter performance improvements

Hello,

Lines https://github.com/maximecb/zetavm/blob/master/vm/interp.cpp#L375-L943 dispatches bytecodes on a switch and this can be a big performance bottleneck specially while dispatching large number of bytecodes. This can be improved by threaded dispatch via computed gotos.

Also https://github.com/maximecb/zetavm/blob/master/vm/interp.cpp#L155-L248 might benefit using a map<K,V> than an if-else as this is referenced from inside the main VM loop.

I am happy to raise a pull request if you say! 😄

Global & local package names, feedback wanted

Soliciting opinions & feedback regarding this issue. This is an problem that we have to address before the zeta package manager goes online, and preferably should be solved early on.

Currently, when you import a package in Zeta, there's a non-trivial amount of logic going on in packages.cpp: https://github.com/zetavm/zetavm/blob/master/vm/packages.cpp#L496
https://github.com/zetavm/zetavm/blob/master/vm/packages.cpp#L556

I have two regexes to validate the package path format in there. I'm wanting to force packages paths to be of the form "foo/bar/bif/N", where N is a version number. One of the issues there is that people may want to import local files as packages. This is in conflict with my desire to standardize the paths of packages in the packages directory, the "standard" packages that come with the VM or will be managed by the package manager.

I'm starting to think that probably, what we need is a different package name syntax for global/non-local packages. Those being the core packages, what's under packages, those that will be managed by the VM and package manager.

I was thinking that we could force global package names to begin with a colon character, like this:

var io = import ":core/io/0";

This would be in contrast to local package names, which can be any local file path:

var myPkg = import "/user/foobar/../some_unix_path.pls";

Having a separate format for non-local path will simplify the path validation logic, and it might have some security benefits. That is, it's more difficult to accidentally import a local package when you wanted to import a global one, and vice versa.

It is technically possible on unix/linux to create a path or file name with a colon in it, but with this syntax, any package with a name starting with ":" will be looked up as a global package. To look up a local module with a colon, you would do:

var myModuleWithAWeirdName = import "./:colonFileName.pls";

Memory leak?

Hi all,
This is 360 CodeSafe Team, we found a suspicious memory leak, at

zetavm/plush/codegen.cpp

Line 1005 in 61af9cd

auto joinBlock = new Block();

auto joinBlock = new Block(); allocate a memory region with Block type, and assign the memory address to the parameter ctx through ctx.merge(joinBlock);. There is no delete in genStmt(), the callee side. And there is no delete in caller side either, see https://github.com/zetavm/zetavm/blob/master/plush/codegen.cpp#L824 . And there is no destructor of CodeGenCtx. So we have reason to believe that the memory allocated at

zetavm/plush/codegen.cpp

Line 1005 in 61af9cd

auto joinBlock = new Block();

may leak.

Since I'm unfamiliar with zetavm, please forgive me if there is anything wrong with my description.

Qihoo360 CodeSafe Team

Implementation of a LISP-like language front-end

Title says it all. I would like to implement a lisp-like language front-end for Zeta. I was thinking that the LISP dialect supported could be named "miniLISP". Another possibility would be to go for full-on scheme support.

String escaping function

I was writing code to generate CSV files yesterday, and it occurred to me that it would be useful to have a string escaping function as part of the std/string library (see plush/string.pls). I propose we simply name this function "escape". It should follow the C string escaping rules, and could be a direct port of the escapeStr function found in vm/serialize.cpp.

Overflow in arithmetic causes Zeta to crash

You can test this easily by adding the test line:

assert (65537 * 65535 == -1);

To any test you want, like simple_exprs.pls. For me on Windows 7, this gives a rather useless standard-Windows-OS error with this set of details:

Problem signature:
  Problem Event Name:	APPCRASH
  Application Name:	zeta.exe
  Application Version:	0.0.0.0
  Application Timestamp:	594b67d5
  Fault Module Name:	zeta.exe
  Fault Module Version:	0.0.0.0
  Fault Module Timestamp:	594b67d5
  Exception Code:	40000015
  Exception Offset:	0000000000030888
  OS Version:	6.1.7601.2.1.0.256.1
  Locale ID:	1033
  Additional Information 1:	e270
  Additional Information 2:	e27047525f7bbc7539be3fd841a27205
  Additional Information 3:	5310
  Additional Information 4:	531031f3f1e05e5a47a30dcb69debe58

I'm not sure precisely why this is, but I suspect the Word union didn't get fully updated to use 32-bit math while it allows 64 bits internally, and perhaps overflowing is setting bits beyond the initial 32.

EDIT: Overflow may not be the correct term, since the example given should still fit within 32 bits and it still crashes. It seems related to wrapping around from positive to negative. This is not tied to multiplication in particular; the code

assert (2147483647 + 2147483647 <= 0);

Will also produce the same type of crash, and it also fits in 32 bits (but should wrap to be a negative number).

This is currently stopping any progress on the RNG code, where almost everything involves overflow. I'll help however I can on this, but I don't know any of the C/C++ intricacies in play here (I just learned that there are more than 2 kinds of cast in C++ today...).

Core VM identifier limitations

While working on the Scheme implementation I discovered that the core VM has a somewhat limited set of legal identifiers (sequences of alphanumeric characters and underscores). Scheme, OTOH, is quite permissive (interestingly, scanning is more complicated than parsing). For example, many common functions end with '?' like 'boolean?'.

In order to support multiple language properly, Zeta will to take the union of legal identifiers across all supported languages.

Garbage Collector implementation

TL;DR This is my attempt of implementing a garbage collector for ZetaVM, if you have any suggestions please comment (especially if you think I'm doing something wrong).

In the next few weeks I'll try to implement a first version of a garbage collector for ZetaVM.
Right now the VM is missing one, so it is relatively crucial to have one. I can't promise an incredibly good Garbage Collector, mainly because this is also my first Garbage Collector. It will be a very simple mark and sweep, with no moving memory and block the world.

(By Values with allocated memory I mean the Values that are created with this function)

The main point of my plan are:

Implement a function (let's call this function get_stack_values for now) that returns all the Values with allocated memory on the stack at a given moment
Implement a LinkedList which will contain all the Values with allocated memory
Implement the allocated function for the VM class.
Add a check in the alloc (in the VM class) function that checks if a threshold value is overcome by the amount of allocated memory, if yes then starts the garbage collector function.
Implement the garbage collector function. It will first call get_stack_values and mark the returned Values and their children recursively. It will then traverse the aforementioned LinkedList and sweep (free the memory) of the Values that have not been marked. It will then set the threshold value to double the size of the current allocated memory.

Tricky stuff that I should take care of (Please suggest more):

Strings, they are in hashmap contained in the StringPool class, so I should also delete the entry from there

Implement "A Little Smalltalk"

I'd like to explore creating a version of Tim Budd's "A Little Smalltalk" on top of ZetaVM, perhaps as a precursor to a more fully featured Smalltalk.

Understanding how plush works

So I'm trying to add float support to plush mainly to be able to test floats in a higher level language.
Now, I'm not fully sure about how the front end for plush works.
My main point of confusion is the fact that the parser has some parts written in cpp and others in plush.
Moreover, the runtime is written in plush but the code generation in cpp.

So the runtime.pls is firstly interpreted and it seems that it adds support to overloaded symbols, such as +

The code generation in cpp produces .zim code and when it encounters an overloaded symbol it uses functions defined in runtime

What about parser.pls? Is it ever used?

Finding global variables (GC implementation)

In the process of trying to make a very simple GC, or at least starting trying out things for it, I wanted to print all the declared variables when VM::alloc is called.

I've been successful to print variables that are in the stack with this function:

void getStackVariables() {
    for (Value* ptr = stackPtr; ptr != stackBase; ptr++) 
    {
        if (ptr[0].isPointer())
        {
            std::cout << ptr[0].toString() << std::endl;
            if (ptr[0].isObject())
                {
                for (auto itr = ObjFieldItr(ptr[0]); itr.valid(); itr.next())
                {
                    auto fieldName = itr.get();
                    std::cout << "field: " << fieldName << std::endl;
                }
            }
            else if (ptr[0].isString()) 
            {
                std::cout << "string: " << (std::string)ptr[0] << std::endl; 
            }
        }
    }
}

This function is inside interp.cpp

What I'm doing is just traversing the stack and printing the Values that are pointers (the Values I'm interested in garbage collecting). This works fine, but I can't find global variables when this function is called when the interpreter is working on a function.

I understand that global variables in Plush are fields of the @global_object, but I can't seem to find where this object is store during execution. What I want to do is finding all global variables that are active when calling the function I've written and check if they are pointers. If they are then it'll print them.

I'm sorry if the description happens not to be clear :/

An instruction to invoke opcodes on the stack

I had an idea that I've implemented (basically).
It is an opcode that uses the value from the top of stack as the opcode which is then executed.

Ex:
{ op: "push", val: 10 },
{ op: "push", val: 20 },
{ op: "push", val: 6 }, #ADD_I32 opcode
{ op: "invoke"},
{ op: "push", val: 3 },
{ op: "push", val: 8 }, #MUL_I32 opcode
{ op: "invoke"},
{ op: "ret"},

"ìnvoke" is the actual instruction. It "becomes" ADD_I32 in the first case and "MUL_I32" int he second. So the preceding code returns 90. ( (10+20)*3 )
This could be useful to implement "eval" like functions in languages that support it. Or allow to compile to a form of "dynamic code".
As of now only instructions that don't need arguments can be called this way, and there seems to be an issue with the "JUMP_STUB" opcode (it tries to update the opcode in memory). So not all opcodes make sense invoking dynamically.

Anyways.

Array library (std/array)

In a similar vein to #73, we should create a library of useful array functions. These functions should, where it makes sense, use the same naming and argument conventions as the string utility library.

This library should be written in Plush, and the package should go under std/array/0.

Functions we probably want to see:

Suggestions for functions to add or on how to better organize things are welcome.

C++ Plush compiler (cplush) deprecation

Some of the language keywords require whitespace after them. E.g. return 42; should be valid but return42; should be not. Currently Plush parser does not check for this. Allowing you to write code like

varmyvar = 10;
print(myvar);

var throwAwayTheArgument = function(f)
{
  return42;
};

print(throwAwayTheArgument(0));
throwAwayTheArgument(12);

Also it throws error when function name starts with any of the keywords like throw, return, var etc. Here in line throwAwayTheArgument(12);, Paser parses it as throw AwayTheArgument(12); causing an error ERROR: get_field failed, missing field "AwayTheArgument".

What about debugging?

Kinda really important to be able to debug a language once it's done, isn't that a feature of the VM that needs to exist?
Would be cool if you could just use the same remote debugging protocol that v8 has, i think it's called webkit remote debugging protocol

also what's the status right now, can I make a language on top of zetavm right now, and how?

How lines in image files can be mapped to lines in source files?

If an error occurs somewhere in a running program, is it possible to get the corresponding line number in the source file? Are related debugging features implemented in Zeta?

JavaScript transpilers and browsers use a similar system called source maps, and GCC/GDB also use a similar system with the -g option.

Thanks.