vmware / database-stream-processor-compiler Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 2.0 349 KB

Infrastructure to run programs written in high-level languages on top of the Database Stream Processor (DBSP) runtime.

License: Other

Rust 99.68% Shell 0.32%

database datalog incremental programming-language streaming

database-stream-processor-compiler's People

Contributors

Stargazers

Watchers

Forkers

isabella232 seanpm2001

database-stream-processor-compiler's Issues

[RFC] Interacting with DDlog programs

Interacting with DDlog programs

A DDlog program is compiled into a library. The task of sending and receiving data from DDlog programs is left to applications built on top.

Interaction with DDlog programs can be made either through the CLI or through API calls. The CLI operations are in fact interpreted by the CLI tool and converted into API calls. The semantics of these API calls is currently not well specified.

Here are things that should be clearly specified:

which API calls can fail and when. For example, inserting a row into a table with a primary key that would violate the primary key constraint can fail. What other operations can fail?
what is the state of the DDlog computation after a failure? The interaction with DDlog is performed by executing multiple commands. If a failure occurs which commands have affected the DDlog runtime state and which haven't? Does committing a transaction after a failure cause and insertions or deletions to occur? Should a failure imply an automatic rollback?
The spec indicates that a transaction commit call can only fail if there is no transaction in progress. Is there any other reason a transaction commit can fail? If yes, what is the state of the system after the failure? Same questions apply to rollback.
When can be indexes be safely read? Can indexes be read while a transaction is in progress? Is there a race between index access and transaction commits performed by a separate thread?
Does the order of operations within a set of commands matter? Does a set of commands performing a deletion followed by an insertion mean the same thing as no command, or is the result dependent on the relation contents?
Some API calls must be invoked only within a transactions, but for some it is not clear whether they can be invoked at any time.
I suspect that the order of API calls performed between a start and end within a transaction matters. The spec indicates that updates are buffered and applied at commit time. Are API calls such as ddlog_clear_relation also buffered?
There seems to be no way to report back why a certain apply_commands call has failed. Should some error metadata be returned on error?
Are operations like deletion of a missing record or insertion of an existing record in a relation errors or no-ops? Related to this question, perhaps there should be several kinds of input relations: some which apply distinct automatically to input relations, some which assume that the external operations always preserve the relation as a table (and which may have undefined results if this is protocol is not obeyed by the environment), and some which return errors when the relation would become inconsistent.
Commands such as ddlog_dump_input_snapshot must be run within a transaction? If not, how do they interfere with concurrent transactions that may be happening.

[RFC] Traits

No design proposal yet. Just collecting pros and cons of adding support for traits (or some other form of generics) to the language.

Why supporting generics (traits) is a good idea

Avoid Rust compilation errors. Since we don't have a way to specify trait bounds, extern functions that require such bounds must be declared without trait bounds in DDlog. Calling such a function with an argument that does not satisfy the bounds causes Rust compilation errors.
Cleaner way to implement string conversions
Write more logic in DDlog without having to resort to Rust.
Avoid blanket trait implementations for all types in the program, e.g., applications that do not require serialization can skip implementing serde traits, while those that do can only implement Serialize/Deserialize for a subset of types. Same goes for FromRecord, ToRecord, Mutator.
Generics increase the expressive power of relation transformers and enable reusable tranformer libraries by allowing transformers to be generic over argument and return types. In the current implementation of transformers in DDlog-1 we achieve this by requiring the caller to provide "accessor" functions with input relations. Traits are a cleaner way to achieve the same.

Why traits are a bad idea

Design and implementation complexity.
Not exactly a downside, but something to consider: much of the power of traits in Rust comes from the auto-derive mechanism. Since we are not realistically going to support macros (certainly not procedural macros) any time soon, we will need a way to delegate trait implementation to Rust, e.g., by having a #[derive] attribute that will tell DDlog that the type implements the trait, but will delegate the actual implementation to Rust.
The previous item will make the interpreter story more complicated, as we will need to expose trait implementations written in Rust to the interpreter.

[RFC] DDlog Packages and Rust integration

This RFC proposes a systematic way to organize DDlog code into packages and map
these packages into Rust crates. As part of this, we propose a design for
integrating native Rust code into a DDlog project.

Motivation

We address the following limitations of DDlog-1:

DDlog-1 does not have the notion of a project or package. Starting with the
user-specified top module, the compiler finds all its transitive dependencies and
generates a Rust project by automatically partitioning these dependencies into
crates based on heuristics. This approach has proved flawed in multiple ways.
- Suboptimal crate structure (with too many of too few crates).
- Configuration options, e.g., library paths, must be specified as CLI arguments
  to the DDlog compiler.
- The lack of a project metadata file complicates IDE integration.
- It is near-impossible to write native Rust code that depends on definitions
  from other modules, as these modules may end up in arbitrary crates.
- It is hard to distribute DDlog libraries
- Managing dependencies on extern Rust crates is tricky. One must make sure
  that at most one DDlog module imports each Rust crate in its module.toml
  file.
Native Rust code is not self-contained. A .dl module can be accompanied by
a .rs file that implements extern functions and types declared in this module.
The contents of the .rs file is concatenated with DDlog-generated Rust code and
copied to the generated Rust project. This leads to poor ergonomics, as
developers cannot use standard Rust development tools to write these files.
Moreover, Rust compiler errors point to locations in the generated file that must be
manually mapped back to the original Rust code.

Packages

DDlog-2 code is organized in packages. Like a Rust crate, a package
consists of a tree of modules and a metadata file specifying package
dependencies. There are two types of modules: DDlog modules and native
Rust modules. The DDlog-2 compiler converts the package into a Rust crate
by generating Cargo.toml for the package and a Rust module module.rs for
each DDlog module module.dl. Native Rust modules are included in the Rust
project as is, and in place, so that Rust compiler messages point to actual
source code locations.

Package structure

A DDlog package looks a lot like a Rust crate. The package.toml file in the
root directory contains package metadata: name, version, description, etc., path
to the main module (e.g., lib.dl), and dependencies. A package can have two
kinds of dependencies: Rust crates and other DDlog packages. The former can
point to crates.io, git repository, or local folder, the latter can initially only point
to a local folder or a git repo, but it should be possible to implement support
for both git repositories and for crates.io in the future (see below).

my_package/
├── package.toml
└── src/
    ├── lib.dl
    ├── mod1.dl
    ├── mod2.dl
    ├── mod2.rs
    └── mod3/
        └── mod.dl

A module can consist of a single file or a file tree. In the above example,
mod1.dl is a single-file DDlog module, and mod2.rs is a single-file Rust
module. mod2.dl contains DDlog bindings for Rust definitions
in mod2.rs. This file can only contain function prototypes without
implementation and type definitions (see discussion of extern types below).
Ideally, this file should be generated automatically from mod2.rs, but we may
want to leave this for future work.

Similar to lib and bin crates in Rust, we may want to distinguish library
packages and executable packages, where only the latter can be used to
instantiate a dataflow.

Generated code structure

In contrast to DDlog-1, we place the generated Rust code under the package directory.
We generate Cargo.toml in the top-level folder. Each .dl module is compiled to a
Rust module and stored in the src_rs directory that mirrors the module structure
of the DDlog package. Native Rust modules remain unmodified at their original location.
Rust's #[path] attribute is used to link the native modules to the generated
Rust project:

my_package/
├── Cargo.toml
├── package.toml
├── src/
│   ├── lib.dl
│   ├── mod1.dl
│   ├── mod2.dl
│   ├── mod2.rs
│   └── mod3/
│       └── mod.dl
└── src_rs/
    ├── lib.rs
    ├── mod1.rs
    └── mod3/
        └── mod.rs

Native types and functions

As discussed above, a DDlog package can contain native modules implemented in Rust.
A native module is accompanied by a .dl file that declares DDlog bindings for
types, functions, and trait implementations exported by the Rust module, e.g.,

/// Function signature (no implementation).
pub fn f<T: Ord>(arg: T) -> bool;

/// `impl` block consisting of function signatures only.
impl MyStruct {
    fn f1(self) -> bool;
}

/// Trait `impl` without a body.
impl Ord for MyStruct;

/// OpaqueType can be declared in Rust as a struct, enum, or alias.
type OpaqueType;

We sometimes want to expose Rust types like Option and Result to DDlog not as
opaque types but as structs or enums with constructors. In DDlog-1 we did so by
re-declaring the types in DDlog and implementing conversion functions to/from
std and ddlog_std versions of the type. This did not exactly improve Rust
developers' experience.

We therefore propose that in DDlog-2 one can use Rust structs and enums directly
by describing the structure of the struct or enum to the DDlog compiler. Consider
the following native Rust module and accompanying DDlog binding that expose the
Rust Option type to DDlog:

/// option.rs

// Re-export Rust `Option` type to DDlog
pub use std::option::Option;

/// option.dl

// Instead of declaring `Option` as an opaque type, tell DDlog about its constructors.
// DDlog knows that `module.dl` contains bindings for native Rust code in `module.rs` and
// will not generate a duplicate Rust definition for this type.
pub enum Option<T> {
    None,
    Some(T)
}

Distributing DDlog package via `crates.io`

One advantage of the proposed design is that DDlog packages map directly to Rust crates and
can be distributed as such. This has dual benefits: DDlog developers can use the crates
ecosystem for software distribution and dependency management; conversely, Rust developers
can easily incorporate DDlog libraries in their programs. Since crates.io doesn't support DDlog
package file format, one must include the generated Cargo.toml file in the distribution.
The next question is whether we need to distribute generated Rust sources along with .dl
files. This is probably undesirable and can be avoided by using a build.rs that invokes
the DDlog compiler to convert DDlog sources to Rust at build time.

[RFC] Issues and limitations in DDlog-1.

This is a laundry list of issues and limitations in the first version of DDlog. We cannot realistically address all of them, but as we are starting a new iteration of the project, this is a good time to brainstorm our options and incorporate solutions to some of the problems in the design of DDlog-2.

[The plan is to write a paragraph for each bullet point below and use separate RFCs for detailed discussions]

Language ergonomics

Confusing syntax and semantics of group_by (addressed by the design proposed in #2)
- group_by currently combines projection and grouping operations. This is the only time when an explicit projection is required in DDlog, as in all other situations the compiler automatically keeps the subset of variables used in the remainder of the rule and projects away everything else. Projection is easy to get wrong. Put differently, the programmer does not normally need to explicitly define the output relation produced by each ddlog operator. The compiler maintains that relation automatically as a tuple of all variables defined in the prefix of the rule and used anywhere in its suffix. In case of group_by the input relation must be constructed explicitly in order to achieve the desired behavior.
- Variables that are not part of the group key become unavailable after group_by
It is not possible to mix imperative and relational code (addressed by the design proposed in #2). The language currently has two subsets with very different syntax. In essence, the programmer needs to learn two languages that only share the type system. While imperative code can be embedded in a rule, relational operations cannot be used inside functions. This often complicates refactoring, e.g., when code written as a function must be broken up if it needs to operate over multiple relations in the new implementation. As a separate annoying quirk, variables in imperative code are mutable, but variables in rules aren't. Can we unify the two subsets of the language.
Relation transformers (addressed by the design proposed in #2). The ability to write code parameterized by input/output relations would enable new types of modularity and code reuse. This is related to the previous issue, as in a language that unifies functions and rules such transformers could be just functions that take relations as arguments.
DDlog lacks interface inheritance (or traits) (see #5). Without this, our support for generics remains very limited. Interfaces will become more important if we want to support first-class relations and relation transformers, as a transformer will typically need to know something about the type of the input relation. Of course, this is not a minor feature and will bring a lot of complexity along with it.
Cleaner syntax&semantics for streaming extensions. We currently have -1 and differentiation operators, and we support joins and group_by over streams, which behave differently than regular join and group_by. These features are not easy to use to build stream processing applications. We may need additional low-level features (e.g., builtin support for sliding windows), as well as syntactic sugar to build high-level operators on top.
Support for incremental group-by. Groups are not modified incrementally, they are atomic fat objects. This is a prerequisite for incremental aggregation.
Support for Incremental aggregates. Some aggregates can be computed incrementally. We might need separate syntax for incremental aggregation (unified syntax would of course be nice, but I'm not sure it's possible).
More powerful module system: re-exports, selective exports and imports (addresses in #4 and #10).
Add support for disjunction in rules.
Simplify the development of bindings for Rust libraries; simplify writing of extern functions (see #10):
- Rust code lives in isolated files, which complicates the use of Rust tooling designed for code in crates (see #10).
- Exporting Rust types to DDlog requires implementing numerous traits (FromRecord, Serialize, flatbuf-related stuff, etc).
- Auto-generated crate tree makes it tricky to import Rust code outside of the local crate (see #10 ).
Automate interning operations. Calling intern() and ival() is burdensome. Consider injecting these calls automatically. The challenge is that there can exist multiple functionally correct ways to inject these calls with different performance impact.
As a limited version of this, we can only automate interning of strings, so the programmer can write regular quoted string literals ("foo") with the compiler converting them to interned strings, e.g., i"foo" where necessary.
Correct use of the Ref<> type is not obvious for most users. Ref<> is an important performance optimization that avoids copying large values between tables as well as in procedural code (e.g., it can be expensive to return a container by value from a large function). However, the need for Ref and where to best apply it is not at all obvious for most users. One alternative is to get the compiler to wrap values in Ref<> automatically using heuristics, thus completely hiding this type of smart pointer from the user.
.dl and .dat files use different syntax. There is a school of thought that they should be unified, especially as we are planning to add the interpreted mode, so the comand language could just become part of the interactive interpeter CLI.
LINQ-like syntax (addressed by the design proposed in #2). There is no reason we could not have an alternative syntax based on functions applied to collections.
Here is an example:
```
relation R, S, T
T :- S.map(|e| { (e.x + 2, e.y) })
      .filter(|e| { e.x > 5 })
      .join(R, |e| { e.x }, |r| { r.z }, |(e,r)| { (e.x, e.y, r.h) }
      .group_by(|t| { t.x + t.h })
      .map(|g| { g.map(|e| { e.x }).sum() }). 
```
One important benefit of LINQ is that transformers become just functions that happen to take relations. They aren't special in any way.
Souffle components have some nice ergonomic features for reusing computation, including generics. We should look at the more closely: https://souffle-lang.github.io/components
Variable introduction: it is not always clear syntactically which expressions introduce and bind a new variable and which ones use the value of a bound variable. Perhaps we could have a binding indicator, e.g., #v as a shortcut for var v which can be used anywhere, including in patterns.

Performance

DDlog generates inefficient imperative code mostly due to excessive cloning (e.g., return-by-value requires cloning). See #7.
Functional data types should (probably) be used for collections. There are a couple of reasons for this. First, values in relations are immutable, so a small change requires cloning and modifying a potentially large collection. Second, the imperative fragment has pass-by-value semantics, which may also require excessive cloning even with an optimized compiler. Functional data types allow cloning collections in O(1). We already have a functional HashSet. Perhaps, all DDlog collection types should be backed by functional data structures by default. See #7.
Delta queries for long join sequences. DD provides the primitives to implement delta queries and worst-case optimal joins to avoid maintaining intermediate arrangements in long sequences of joins. We had support for that in a branch that is now bit-rotted. I think this is a potentially valuable feature and we should consider supporting it.
Reliance on DD. This is both a positive and a negative. DD does a lot of things for us, but it is not easy to understand and/or debug.
Bottom-up evaluation. This is a problem for programs that instantiate potentially very large relations
Not many optimizations. The compiler could probably do more to optimize the computation (e.g. inter-rule analysis, imperative code dataflow analysis, etc.)
Multi-core scalability is weak. Could this be improved for large deltas?
Implementation is unsafe: due to both weight overflow and timestamps overflow.

Dynamic DDlog

Calling Rust functions from interpreted code. We need to compile Rust libraries as standalone .so with an interpreter-friendly interface.

Compilation speed

Serde blows up executable size and compilation speed. Consider either implementing our own serialization infrastructure (probably not a good idea) or finding a way to use dynamic dispatch with serde in a way that will improve compilation speed, possibly at the cost of a small performance hit.

New features

Persistence

Tooling

Language server
Debugger
Flatbufs are a pain in the back. Can we drop them?

Documentation

The lack of a precise specification of the semantics of the language.
The documentation structure is suboptimal. The tutorial is essentially a language reference - every little feature is documented there. There should be separate documents for teaching and specification.

[RFC] Modular compiler architecture

For now this is a stub for an RFC.

The DDlog 2.0 compiler has to be modular, to enable incremental compiler evolution and experimentation (e.g., adding syntax desugaring passes, extending the language with additional features - e.g., traits, adding new code generators, adding new optimizations).

The experience with the P4 compiler based on a visitor architecture and nano-passes has been very positive. That compiler is written in C++, but in a functional style; it uses the C++ garbage collector and it relies strongly on the const C++ keyword to enforce immutability. The P4 compiler uses a purely functional IR, and tools that automatically generate the visitors for the IR. The IR itself is extensible: even external users can add new classes to the IR (e.g., classes used in some specific back-ends).

The compiler is literally a few hundred passes, some of which are repeated. Each pass consumes an IR and produces a new IR, which is copy-on-write modified from the previous one. There are no side-channels between passes where they can exchange information; if a pass needs to produce additional information for other passes (e.g., type information), that information is conveyed into a separate data structure that is passed alongside with the IR representation.

The visitors are strongly typed and make is easy to write analyses that only modify some parts of the code. For example, strength-reduction only applies to expressions, and you need to write no code that touches statements.

The visitor also allows access to the context of an IR node. IR nodes have no parent pointers, but the visitor traverses recursively all children of a node and provides information to the path from the root that has been taken to reach a node.

Many of these features are present in the Haskell DDlog compiler too, but the visitor API is not specifically spelled out, and the use of side data structures is not explicit.

There are some downsides in the P4 implementation: the IR is a DAG that reuses nodes, and this is a source of some hard to track bugs (sometimes the meaning of an expression is dependent on the context; having the same subtree to represent identical expressions in two different places makes is difficult to distinguish between instances).

In the P4 compiler the type information is not part of the IR, it is a separate data structure. Thus the P4 compiler reruns type inference after any IR tree modification. The benefit is that this is an automatic safety check, that will catch immediately changes to the IR which violate typing. Another benefit is that when you modify the IR you don't also need to also modify the type information (to some degree), so the code is somewhat simpler. The downside is that a lot of time is spent in type checking.

If the compiler will be written in Rust, I am proposing to start by defining the structure of the IR and of the visitors that will be used and by spelling out these rules explicitly. The C++ P4 compiler uses dynamic double dispatch using a standard C++ pattern to dispatch methods based both on dynamic IR node type and on visitor type, using the single-dispatch mechanism of C++ virtual functions. We have to figure out how this should be done using traits. In the P4 compiler there are several visitor base classes: Inspector (IR is read-only, produces side data structures), Transform (modifies the IR), and users subclass one of these. There are also predefined visitor combinators to embed a sequence of visitors, to repeat a sequence of visitors, to choose some visitors conditionally, etc.

Another very powerful feature of the P4 compiler is most of the IR is convertible back to legal P4 that can be compiled and validated. This makes debugging by humans much easier, and helps with compiler validation as well. Only the last few back-end passes which generate code in a different target generate a new kind of IR.

[RFC] DDlog 2.0 Syntax

The syntax is described in an EBNF flavor loosely based off of W3C's EBNF

Note: As of now this grammar is not complete or final, it's just what I've implemented. A lot of the constraints are much loser than they seem like they should be, that's just because of the error recovery I've implemented in the parser, things accepted within parsing are not always valid and errors are emitted

Root = Item*

Item = FunctionDef | RelationDef | TypeDef

RelationDef = Attributes Modifiers keyword:RelKw name:RelName columns:RelCols
RelKw = 'relation' | 'multiset' | 'stream'
RelName = 'ident'
RelCols = '(' columns:RelCol* ')'
RelCol = Attributes binding:Pattern ':' ty:Type ','*

FunctionDef = Attributes Modifiers keyword:'function' name:FunctionName Generics args:FunctionArgs ret:FunctionReturn? body:Block
FunctionName = 'ident'
FunctionArgs = '(' args:FunctionArg* ')'
FunctionArg = Attributes binding:Pattern ':' ty:Type ','*
FunctionReturn = ':' return_ty:Type

TypeDef = Attributes Modifiers keyword:'typedef' name:TypeName '=' body:TypeBody
TypeName = 'ident'

TypeBody = RecordType | SumType
RecordType = constructor:RecordName ('{' fields:RecordField* '}')?
RecordName = 'ident'
RecordField = Attributes name:Pattern ':' ty:Type ','*
SumType = ('|' RecordType)*

Attributes = Attribute*
Attribute = '#[' AttrPair* ']'
AttrPair = 'ident' '=' Expr ','*

Modifiers = Modifier*
Modifier = 'input' | 'output' | 'extern'

Type = GenericType | TupleType | FunctionType

GenericType = 'ident' Generics?
Generics = '<' generics:GenericArg* '>'
GenericArg = Type ','*

TupleType = '(' elements:TupleTypeElem* ')'
TupleTypeElem = Type ','*

FunctionType = 'function' args:FunctionTypeArgs ret:FunctionReturnType?
FunctionTypeArgs = '(' args:FunctionTypeArg* ')'
FunctionTypeArg = Type ','*
FunctionReturnType = ':' Type

Block = '{' statements:Stmt* '}'
Stmt =
    ExprStmt
    | VarDecl
    | IfStmt

ExprStmt = Expr ';'
VarDecl = 'var' binding:Pattern '=' value:Expr ';'

IfStmt = IfBlock* ElseBlock?
IfBlock = leading_else:'else'? 'if' cond:Expr Block
ElseBlock = 'else' Block

Pattern = VarRef | TuplePattern
TuplePattern = '(' elements:TuplePatternElem* ')'
TuplePatternElem = Pattern ','*

Expr =
    Literal
    | VarRef
    | Assign
    | ParenExpr
    | BinExpr
    | IfStmt
    | RetExpr
    | UnaryExpr
    | Block

VarRef = 'ident'

Assign = binding:Pattern '=' value:Expr

ParenExpr = '(' inner:Expr ')'

// TODO: Floats
Literal = Bool | Number | String
Bool = True | False
Number = 'number'
String = 'string'

RetExpr = 'return' expr:Expr

UnaryExpr = op:UnaryOp expr:Expr
UnaryOp = Bang | Minus

BinExpr = lhs:Expr op:BinOp rhs:Expr
BinOp =
    Plus
    | Minus
    | Star
    | Slash 
    | Percent
    | RightRocket
    | Pipe
    | Caret
    | Ampersand
    | Shl
    | Shr
    | And
    | Or
    | EqEq
    | Neq
    | RAngle
    | RAngleEq
    | LAngle
    | LAngleEq

Or = 'or'
And = 'and'
True = 'true'
False = 'false'

Eq = '='
Bang = '!'
Pipe = '|'
Plus = '+'
Star = '*'
Neq = '!='
Shl = '<<'
Shr = '>>'
Caret = '^'
Comma = ','
Colon = ':'
Minus = '-'
Slash = '/'
EqEq = '=='
LAngle = '<'
RAngle = '>'
LBrack = '['
RBrack = ']'
LCurly = '{'
RCurly = '}'
LParen = '('
RParen = ')'
Percent = '%'
Ampersand = '&'
Semicolon = ';'
LAngleEq = '<='
RAngleEq = '>='
RightRocket = '=>'

[RFC] Global relations, input/output relations

We propose to allow let-bindings in the global name space. In particular, this will allow writing DDlog-1 style rules whose inputs and outputs have global visibility:

let R1: Relation<T1>  = ...;
let R2: Relation<T2> = ...;
let R3: Relation<T3>  = R1.filter(...).join(R2, ...);

Note that such declarations can be considered syntactic sugar for 0-arity functions.

Such relations can also be accessed from a relation transformer without explicitly passing them as an argument to the transformer:

let R1: Relation<T1>  = ...;
let R2: Relation<T2> = ...;

fn rule1() -> Relation<T3> {
    R1.filter(...).join(R2, ...)
}
fn rule2() -> Relation<T3> {
    R1.antijoin(R2, ...)
}

// A relation with two rules.
let R3: Relation<T3>  = rule1().concat(rule2);

Global mutually recursive relations must be declared as a tuple:

let (R1, R2): (Relation<T1, T2>) = fixed_point(|r1, r2| ...);

Finally, we can extend this feature to define input and output relations as a special kind of global variables, which seems more ergonomic than passing them as inputs and outputs to the main function:

// An input relation cannot be assigned to.
input R1: Relation<T1>;
output R2: Relation<T2> = .....;

[RFC] Expression compiler

Intro

This RFC focuses on efficient compilation of the DDlog-2 expression language
(i.e., the non-relational subset of the language). Our language design choices,
as well as the choice of the backends we want to support, affect the complexity
(and feasibility) of building an efficient compiler.

Here is the expression language design we've discussed so far:

Syntactically, our expression language is a subset of Rust.
Semantically, it is a pure functional language with pass-by-value semantics
(unless we allow mut variables), i.e., copying a variable to another variable
or field, or passing it as an argument to a function is semantically equivalent
to creating a copy of its value.
Our primary target is Rust, hence we cannot rely on a garbage collector for
automatic memory management (all we have is (1) Rust's ownership model and (2)
limited forms of GC that can be implemented as libraries, e.g., reference
counters).

To the best of my knowledge, the combination of 2 and 3 is unique: functional
languages typically use full-fledged mark-and-sweep GC. The only language I
know that relies on reference counting for GC is Swift, and Swift is
pass-by-reference, not pass-by-value (to be more precise, Swift has struct's
that are not reference counted and are pass-by-value, and classes, which are
reference counted and are pass-by-reference). "Unique" is not a good thing in
this case, I'd rather rely on a well-understood technology, but our requirements
seem to dictate these choices.

So we need compiler + runtime design that implements pass-by-value semantics,
supports automatic memory management, and avoids actual cloning as much as
possible. We discuss a few strawman designs below. A practical solution may
combine more than one of these designs.

Strawman designs

Design 1: wrap all objects in a refcount

The compiler wraps all data types, except primitive types, in a reference
counted pointer. If we want to support mut variables, we probably want to use
a reference counter with copy-on-write support, so that copying is O(1) even
if the value may get mutated later. In addition, the compiler can perform data
flow analysis to avoid incrementing/decrementing the counter when possible.

Pros:

Minimizes cloning.

Cons:

Wastes memory and CPU to store and update a reference counter with each
object.
Even small objects must be heap-allocated. This alone can kill the
performance.

Design 2: Let the user decide

With this approach the user annotates each type to tell the compiler whether
they want it wrapped in a reference counter or not. In Swift, for example, the
class keyword generates a reference-counted type, while structs and tuples are
not reference counted.

Pros:

The user can make optimial decisions based on domain-specific knowledge that
is not available to the compiler.

Cons:

In practice, most users won't know how to decide.

Design 3: Let the compiler decide

The compiler can decide which types to wrap in a reference counter, e.g., based
on their size.

Pros:

Improved performance compared to Design 1.

Cons:

The compiler doesn't have enough information to make an optimal decision.
The API between native Rust and DDlog code becomes unstable, as the compiler
can arbitrarily wrap types in an Arc.

Design 4: Keep reference counters inside libraries

A copy-on-write reference counter is really just one example of a functional
data structure. Functional languages typically come with libraries that
implement many others, e.g., functional maps, sets, heaps, lists.

In this design the compiler does not inject (or even aware of) reference
counters. User-defined types translate directly into equivalent Rust structs,
enums, or tuples. The compiler uses a combination of pass-by-reference
(borrowing), ownership transfer, and cloning to maintain the pass-by-value
semantics. Library data types, especially container types, internally use
functional data structure to reduce the amortized cost of cloning and
modification operations.

Pros:

Simpler compiler design.
Generated Rust type are near-identical to DDlog declarations, so no surprises
for people implementing extern functions.

Cons:

User-defined types, e.g., structs, are not reference counted and will end up
getting cloned when the compiler needs to copy the value (i.e., it cannot
simulate pass-by-value by borrowing or ownership transfer). This can be
expensive, especially for structs with many fields.

Compiler design

All alternatives outlined above require an expression compiler that can maintain
pass-by-value semantics while minimizing actual cloning. In a nutshell,
an expression of the form let x = y, can be compiled to Rust in several ways:

// Option 1: Cloning: always correct, but expensive.
let x = y.clone();

// Option 2: Ownership transfer: variable `y` is no longer used.
let x = y;

// Option 3: Borrowing: `y` outlives `x`; both variables are immutable during
// the lifetime of `x`.
let x = &y;

Option 1 is always valid, but can be expensive, whereas options 2 and 3 require
dataflow analysis to establish the lifetimes and access patterns of x and y.
In general, more complex program rewriting may be required to generate optimal
code, e.g., depending on the context, the following code

let x = y;
let z = y;

may get compiled to:

let x = y;
let z = &x;

In general, even ignoring the algorithmic complexity of optimization, it may
be impossible to statically determine the best implementation.

Optimizations cannot always be applied across function boundaries. For example,
passing input arguments by reference and returning results by value is a
reasonable default calling convention. However, there are cases when
passing by value or returning by reference makes more sense:

// Example: `f1` needs an owned copy of `x`, so `x` should be passed by value
// to avoid cloning.
fn f1(x: LargeType) -> SomeStruct {
    SomeStruct {
        x: x
    }
}

// Example: `f2` returns a field of its argument `s`, and can return
// by-reference to avoid cloning.
fn f2(s: SomeStruct) -> LargeType {
    s.x
}

In this example, the compiler has sufficient information to derive these
optimizations, but if functions f1 and f2 are extern or are declared
without an implementation as part of a trait, then the compiler must use
default calling conventions or rely on user annotations.

[RFC] RVSDG as an intermedeate IR

Background:

As an intermediate representation we should use the RVSDG to represent both the timely-level dataflow graph and the inter-operator expressions (map functions, for instance). Using this unified representation to encompass both aspects of ddlog programs brings a number of benefits:

Intrinsic optimization, if we know both the contents of the dataflow graph and the contents of each function within it we can perform both easy and complex dataflow (in the compiler sense, not the paradigm) optimizations such as relation.filter(|_| false) => empty_relation() and .map(|_| 10).filter(|x| x == 10) => .map(|_| 10)
Trivial dead code elimination on both a dataflow graph and expression level, this allows us to remove functions that are never called, relations that are never used and close inputs that are never read from
Trivial de-duplication of the dataflow graph to allow for minimal re-computation (this extends to both collections and arrangements)
Modular compiler internals. Since the RVSDG is very abstract it's very decoupled from both the frontend and the final codegen target, allowing easy swapping of both or even using the RVSDG as a target for entirely third-party languages or libraries

For those unfamiliar, the RVSDG is a way to represent programs as an abstract graph with only four distinct components:

Nodes, these are atomic expressions within the program such as 1, add 1, 2 or map
Regions, these are nodes that themselves contain sub-nodes. Regions are used for representing both control flow and effect flow, for example the phi node represents an if ... { ... } else { ... } and the theta node represents a loop { ... }
Value edges, these are edges between nodes that express data dependence such as an add node having two incoming value edges from a 1 and a 2 node
Effect edges, these edges express state dependence, such as two print() calls being sequential relative to each other (this is an important note as the RVSDG is inherently unstructured). Effect edges won't be used very much within the ddlog compiler, but they do have an important, albeit sparse, role to play in it

With these four primitives we can express programs in a very abstract and inherently normalizing way. That is to say, many different source language programs can (and will) look identical once translated into an RVSDG due to it's inherent lack of ordering. Instead of add 1, 2; add 2, 3 and add 2, 3, add 1, 2 being seen as two separate programs, the RVSDG simply sees

This (as mentioned before) also comes into effect when analyzing the dataflow graph as a whole since both the nodes of the dataflow graph and the expressions they contain can be unified, cutting the total program's potential runtime in half:

input relation R(x: usize)

output relation X(x: usize)
X(x) :- R(x), x + 5, x > 10.
X(x) :- R(x), x + 5, x < 4.

Ignoring the questionable nature of this program, the raw translation of this program into a graph would render this

From which we can see that the map operation is trivially reducible, that's duplicated effort we've done and we can remove it

That's a rather trivial example that could definitely be optimized further (say, by turning multiple filters on one relation that are then concatenated into a single filter), but it expresses the intent quite nicely.

Another optimization I mentioned was dead code elimination, with two simple rules we can do all the DCE on dataflow graphs that we need to

Any node that cannot reach an output node is dead
Any node that is not reachable by an input node is dead

Both of these graphs would be considered dead code

So to wrap things up, I believe that the RVSDG is an optimal representation for compiling and optimizing our dataflow graphs and the expressions within them

[RFC] AST interpreter.

On the surface, DDlog-2 is a purely functional language. The main function of a program transforms a set of input relations into a set of output relations. However, this transformation is just the first phase of program execution that runs as part of the compilation process. During the second, runtime, phase the user feeds data to the input relations and receives data from the output relations.

As part of the first phase, we must evaluate the program and generate a dataflow graph. This evaluation is performed by the AST interpreter.

Example

Consider the following program that takes relation r1 and returns a vector of 10 relations that multiply the contents of r1 by 1, 2, 3, ... 10.

fn main(r1: Relation<usize>) -> Vec<Relation<usize>> {
    let mut outputs = Vec::new();

    for i in 1..11 {
        outputs.push(f(r1, i))
    }
    outputs
}

fn f<T: Mul<usize>>(r: Relation<T>, multiplier: usize) -> Relation<T> {
    r.map(|x| x*multiplier)
}

The AST interpreter evaluates this program to produce the following dataflow graph:

                           ┌────────┐
             ┌────────────►│ Map(x1)├────────────────►
             │             │        │
             │             └────────┘
             │
             │
             │             ┌────────┐
    r1       │             │Map(x2) │
─────────────┼─────────────►        ├────────────────►
             │             └────────┘
             │
             │
             │               ...
             │
             │             ┌────────┐
             └────────────►│Map(x10)│
                           │        ├─────────────────►
                           └────────┘

The AST interpreter runs after parsing, validation, and type inference phases. Its output is not yet the dataflow graph used by the optimizer (RFC #8). It consists of high-level dataflow operators (i.e., no arrange, consolidate, etc.) and still uses the AST format for expressions inside each dataflow operator rather than RVSDG or whatever other format we choose for the IR.

[RFC] Embedding Datalog in a functional language

RFC: Embedding Datalog in a functional language

This RFC sketches a Datalog dialect implemented as syntactic sugar on top of a general-purpose functional language (e.g., Rust). It strives to combine the expressiveness and flexibility of a general-purpose language with conciseness and clarity of Datalog.

LINQ and Differential Dataflow are two examples of general-purpose languages that support declarative queries over relational data. To this end, they implement all standard relational operators as higher-order functions (filter, map, join, group, etc.) This design has many important advantages:

Flattens the learning curve by relying on familiar programming constructs.
Unifies relational and imperative subsets of the language. Relations are first-class objects and can be passed around as regular values. In particular, relation transformers can be implemented as regular functions.
Takes advantage of existing language infrastructure: type system, expression syntax, parser, type checker, IDE integration and other tooling, documentation, etc.

The main downside of the approach is that it can be much more verbose than Datalog, as one must supply closures to each operator to unwrap the input record, bind its columns to local variables, and wrap the result of the operator back into a struct or tuple. Here is the same rule written in DDlog and in a Rust-like language:

relation R1(a: usize, b: usize)
relation R2(c: usize, d: usize)

// DDlog
R3(x,z) :- R1(x,y),
           R2(y,z).

// Rust
let R3 = R1.join(R2,
                 |r1| r1.b,  // extract join key from R1
                 |r2| r2.c,  // extract join key from R2
                 |r1, r2| R3{e: r1.a, f: r2.d}); // produce output record

Clearly, the second version is harder to understand and easier to get wrong.

We propose a syntactic sugar on top of the functional syntax, which eliminates this overhead while keeping the above advantages. The idea is that instead of writing closures explicitly, the programmer writes patterns that contain variable bindings, and the compiler automatically derives closures that produce and consume tuples containing bound variables. For example, the above rule would be written as:

let R3(x,z) = R1(var x, var y).join(R2(y, var z));

which translates to the Rust code above. Note that this rule is syntactically close to Datalog.

In the rest of this RFC, we define a vocabulary of relational operators and their pattern-based versions. We leave the following questions outside the scope for now:

The choice of the base language. While we can define the new language from scratch (similar to DDlog-1), it can also be designed as an extension of an existing language, e.g., Rust. There are all kinds of pros and cons to both options and this choice will have major impact on the structure of the project.
Support subset of the language. The exact subset of the language that can be compiled into efficient incremental code is limited by the capabilities of the compiler and the backend. For example, even though we can manipulate relations as values in the language, Differential Dataflow does not support this, so we may only be able to compile programs where all rules can be evaluated statically.
Compiler architecture. Depending on the answer to question 1, the compiler will be built as an extension of an existing compiler (e.g., rustc), an embedded DSL (e.g., in GHC or C#), or a standalone compiler.
Streaming features.

In the following sections we sketch the translation of various DDlog-1 elements to a pure functional language based on Rust (ok, it's really just sloppy Rust, where I ignore lifetimes and borrowing), and D-Rust, a hypothetical new language that adds syntactic shortcuts on top of Rust. These shortcuts are optional. The programmer may choose to write some or all rules using vanilla Rust.

Relation declarations

DDlog

input relation R1(a: usize, b: usize)
input relation R2(c: usize, d: usize)
output relation R3(: usize, d: usize)

Rust

struct R1 {
    a: usize,
    b: usize
}
struct R2 {
    c: usize,
    d: usize
}
struct R3 {
    e: usize,
    f: usize
}

fn main(r1: Relation<R1>, r2: Relation<R2>): Relation<R3> {
    let r3: Relation<R3> = ... ;

    r3
}

Note that DDlog combines record type declaration, relation type declaration, and relation instance declaration in a single line, while Rust requires three separate declarations. Note also that Rust doesn't require input/output annotations: input relations are relations passed as arguments to the main function, output relations are returned by the function, and intermediate relations are local variables declared inside the function. This enables modularity and reuse, as an output of one function can be used as an input to the other, which amounts to relation transformer composition.

D-Rust

TODO: I am not sure how to define meaningful shorthand syntax for relation type and instance delarations.

Relational operators

We define relational operators as methods of type Relation<>.

impl<T> Relation<T> {
    fn join<T2, K, O>(
        self, other: Relation<T2>,
        k1: fn(T) -> K,
        k2: fn(T2) -> K,
        jfun: fn(T, T2) -> O) -> Relation<O>;

    fn antijoin<T2, K>(
        self, other: Relation<T2>,
        k1: fn(T) -> K,
        k2: fn(T2) -> K) -> Relation<T>;

    fn map<O>(self,
              f: fn(T) -> O) -> Relation<O>;

    fn filter(self,
              f: fn(T) -> bool) -> Relation<T>;

    fn group<K, V>(self,
                   f: fn(T) -> (K, V)) -> Relation<Group<K, V>>;

    fn flatten<T2, I>(self,
                      f: fn(T) -> I) -> Relation<T2>
    where
        I: IntoIterator<Item=T2>;

    fn union(self, other: Relation<T>) -> Relation<T>

    // fixpoint computation that evaluates a recursive fragment of the program:
    //  starts with initial contents of recursive relations and applies `f` until reaching 
    // a fixpoint.
    fn fixpoint(init: [Relation],
                     f: Fn |[Relation]| -> [Relation] ) -> [Relation];
}

`join`, `antijoin`

DDlog

R3(x,z) :- R1(x,y),
           R2(y,z).
R4(x) :- R1(x,y),
         not R2(y, _).

Rust

fn main(r1: Relation<R1>, r2: Relation<R2>): (Relation<R3>, Relation<R4>)
{
    let r3 = r1.join(r2, |r1| r1.b, |r2| r2.c, |r1, r2| R3{e: r1.a, f: r2.d});
    let r4 = r1.antijoin(r2, |r1| r1.b, |r2| r2.c)
               .map(|r1| R4{g: r1.a});

    (r3, r4)
}

D-Rust

fn main(r1: Relation, r2: Relation): (Relation, Relation)
{
let r3 = r1(var x, var y).join(r2(y, var z)) -> R3(x, z);
let r4 = r1(var x, var y).antijoin(r2(y)) -> R4(x);

(r3, r4)

}

`map`, `filter`

DDlog

R3(x,z) :- R1(x,y),
           x > 0,
           w = x + y,
           R2(w,z).

Rust

let r3 = r1.
         filter(|r1| r1.a > 0).
         map(|r1| (r1.a, r1.b, x + y)).
         join(r2, |(x,y,w)| w, |r2| r2.c, |(x,y,w), r2| R3{e: x, f: r2.d});

D-Rust

let r3 = r1(var x, var y).
         filter(x>0).
         bind(var w = x+y).
         join(r2(w, var z)) -> R3{e: x, f: z};

Here I use bind as a more intuitive synonym for map.

`group_by`

DDlog

R2(y, m) :- R1(x,y),
            var m = x.group_by(y).min().

Rust

let r2 = r1.map(|r1| (r1.b, r1.a))
           .group()
           .map(|g| R2{c: g.key(), d: g.min()});

D-Rust

let r2 = r1(var x, var y).
         project(x).
         bind(var g = group_by(y)) -> R2{c: y, d: g.min()};

`flatten`

DDlog

relation R1(x: usize, ys: Vec<usize>)

R2(x,y) :- R1(x, ys),
           var y = FlatMap(ys).

Rust

let r2 = r1.map(|r1| r1.ys.map(|y| (r1.x, y))).flatten().map(|(x,y)| R2{x,y});

D-Rust

let r2 = r1(x, ys).bind(var y = ys.flatten()) -> R2{x, y};

Union, recursion

DDlog

Path(x, y) :- Edge(x,y).
Path(x, z) :- Path(x, w), Edge(w, z).

Rust

let path = Relation::fixpoint(
    [edge.map(|e| Path{e.from, e.to})],
    |[path]| [path.union(path.join(edge, |p| p.to, |e| e.from, |p, e| Path{p.from, e.to}))]);

D-Rust

We may want the compiler to detect recursive components automatically (Datalog-style). Although there is also a lot to be said for forcing the programmer to be explicit about recursion.

// initial assignment
let path = edge(var x, var y) -> Path{x, y};
// recursive step
path += path(var x, var y).join(edge(y, var z)) -> Path{x, z};

vmware / database-stream-processor-compiler Goto Github PK

database-stream-processor-compiler's People

Contributors

Stargazers

Watchers

Forkers

database-stream-processor-compiler's Issues

Interacting with DDlog programs

Why supporting generics (traits) is a good idea

Why traits are a bad idea

[RFC] DDlog Packages and Rust integration

Motivation

Packages

Package structure

Generated code structure

Native types and functions

Distributing DDlog package via crates.io

Language ergonomics

Performance

Dynamic DDlog

Compilation speed

New features

Tooling

Documentation

Intro

Strawman designs

Design 1: wrap all objects in a refcount

Design 2: Let the user decide

Design 3: Let the compiler decide

Design 4: Keep reference counters inside libraries

Compiler design

Example

RFC: Embedding Datalog in a functional language

Relation declarations

DDlog

Rust

D-Rust

Relational operators

join, antijoin

DDlog

Rust

D-Rust

map, filter

DDlog

Rust

D-Rust

group_by

DDlog

Rust

D-Rust

flatten

DDlog

Rust

D-Rust

Union, recursion

DDlog

Rust

D-Rust

Recommend Projects

Recommend Topics

Recommend Org

Distributing DDlog package via `crates.io`

`join`, `antijoin`

`map`, `filter`

`group_by`

`flatten`