harvardpl / formulog Goto Github PK

View Code? Open in Web Editor NEW

144.0 13.0 8.0 2.91 MB

Datalog with support for SMT queries and first-order functional programming

Home Page: https://harvardpl.github.io/formulog/

License: Apache License 2.0

ANTLR 0.51% Java 91.20% C++ 7.92% Vim Script 0.14% Dockerfile 0.07% CMake 0.17%

datalog logic-programming smt static-analysis symbolic-execution

formulog's People

Contributors

Stargazers

Watchers

Forkers

hessammehr mgree ufwt srsb2018 ktrianta bulksecuritygeneratorprojectv2 aaronbembenek shingarov

formulog's Issues

Interpreter bug: incorrect behavior around SMT variables

This program causes the interpreter to behave incorrectly:

rel bar(i32)

rel foo(i32 smt)

rel baz

foo(`#x[?]`).

baz :-
  bar(X),
  foo(`X`),
  X + 1 = 42.

bar(0).

(* Make sure `foo` is recursive in `ok` stratum, so that when `ok` rule is run,
   `foo` predicate is reordered to be first *) 
foo(`0`) :- baz.

foo(`#x0[?]`).
foo(`#x1[?]`).
foo(`#x2[?]`).
foo(`#x3[?]`).
foo(`#x4[?]`).
foo(`#x5[?]`).
foo(`#x6[?]`).
foo(`#x7[?]`).
foo(`#x8[?]`).
foo(`#x9[?]`).
foo(`#xa[?]`).
foo(`#xb[?]`).
foo(`#xc[?]`).
foo(`#xd[?]`).
foo(`#xe[?]`).
foo(`#xf[?]`).

foo(`41`).
bar(41).

fun len(xs: 'a list): i32 =
  match xs with
  | [] => 0
  | _ :: t => 1 + len(t)
  end

rel ok

ok :-
  baz,
  len(foo(??)) = 19.

Run in semi-naive evaluation mode, it derives only 18 foo facts. Run in eager evaluation mode, it crashes in the baz rule trying to add an SMT variable and a primitive integer.

The problem is that the interpreter currently assigns numeric IDs to solver variables independently of IDs to other terms, so that two different terms (a solver variable and something else) can have the same ID. This breaks assumptions made elsewhere in the interpreter.

Incorrect Result?

Hi guys,

Consider the following program:

input a(i32).
input d(i32).
@bottomup
output b(i32).
@topdown
output c(i32, i32).
@topdown
output ppp(i32).
@topdown
output q(i32).
a(0). a(1). a(2).
d(0). d(2). d(4).
ppp(X) :- a(X).
b(X) :- d(X).
c(X, Y) :- b(X), ppp(X), b(Y).
q(X) :- c(X, X).
output ok.
ok :- q(2).

output e__( i32, i32)
output f__(i32, i32)
output g__(i32, i32)   (* ------------ Output relation ------------ *)
output h__(i32)
output i__(string, i32)

e__(Var_b, Var_a) :- c(Var_a, Var_b).
f__(Var_d, Var_c) :- e__(_, Var_d), a(Var_c), !b(Var_c), ppp(_), e__(_, _).
f__(Var_b, Var_b) :- e__(Var_a, Var_b), !e__(Var_a, Var_b).
g__(Var_a, Var_c) :- d(Var_a), f__(_, Var_c).
h__(Var_g) :- f__(Var_g, Var_f), f__(Var_g, _), f__(Var_g, Var_g), q(_), h__(Var_f).

I get the following result for the relation g__

g__(0, 1)
g__(2, 1)
g__(4, 1)

If however I add a new rule f__(Var_a, Var_b) :- f__(Var_a, Var_b)., the result for g__ should not change. But if I run the following new program:

input a(i32).
input d(i32).
@bottomup
output b(i32).
@topdown
output c(i32, i32).
@topdown
output ppp(i32).
@topdown
output q(i32).
a(0). a(1). a(2).
d(0). d(2). d(4).
ppp(X) :- a(X).
b(X) :- d(X).
c(X, Y) :- b(X), ppp(X), b(Y).
q(X) :- c(X, X).
output ok.
ok :- q(2).

output e__( i32, i32)
output f__(i32, i32)
output g__(i32, i32)   (* ------------ Output relation ------------ *)
output h__(i32)
output i__(string, i32)

e__(Var_b, Var_a) :- c(Var_a, Var_b).
f__(Var_d, Var_c) :- e__(_, Var_d), a(Var_c), !b(Var_c), ppp(_), e__(_, _).
f__(Var_b, Var_b) :- e__(Var_a, Var_b), !e__(Var_a, Var_b).
f__(Var_a, Var_b) :- f__(Var_a, Var_b).  (* ------------ New Rule ------------ *)
g__(Var_a, Var_c) :- d(Var_a), f__(_, Var_c).
h__(Var_g) :- f__(Var_g, Var_f), f__(Var_g, _), f__(Var_g, Var_g), q(_), h__(Var_f).

I get:

g__(0, 1)
g__(0, 2)
g__(2, 1)
g__(2, 2)
g__(4, 1)
g__(4, 2)

Formulog version I am using: 508c768
Operating System: Debian GNU/Linux 10 (buster)
Kernel: Linux 5.4.163.1.amd64-smp

Convert user-relevant system properties to proper flags

Right now, Formulog is parameterized by a mix of system properties (like -DdebugSmt) and proper flags (like --dump-idb). For the sake of consistency and usability, we should convert user-relevant system properties to flags.

CSV reference

In 01-language-basics, in the section "External input relations", TSV should be defined, and the last paragraph contains a reference to "CSV" which should be "TSV". You may also want to emphasize somehow that the example document has tabs and not spaces.

Incorrect result?

Can `int smt` types be used

The 04_logical_formulas document section "Integers" describes a set of solver operations on types of int smt, but 01_language_basics does not describe int as one of the 5 built-in primitive types. Is there an int primitive type, or are these somehow auto-converted to an i32 or i64?

Remove built-in functions `substitute` and `is_free`

The built-in functions substitute and is_free have not proven to be useful in the case studies, and can be written in Formulog directly (i.e., don't require special runtime support), so we should consider removing them.

Generated code hangs during child process creation

Running with many threads, some threads hang in a call to read that happens during the constructor for boost::child::process. Apparently there are a couple issues related to pipes and creating child processes in a multithreaded setting:

Interpreter bug: destructor assumes term is a constructor

This program causes the Formulog interpreter to crash:

rel foo(bool smt)

foo(`true`).

rel not_ok

not_ok :- foo(`#x[i32] #= _`).

rel ok

ok :- !not_ok.

The code for a destructor assumes that its term is a Constructor object, whereas here it is a primitive (bool).

How should I represent user-defined types that are SMT bit-vectors

I'd like to create user-defined types to represent IP addresses. Specifically, a bv[32] to represent an IPv4 address and a bv[128] to represent an IPv6 address. Since i32 is interchangeable with bv[32] I'm currently just doing:

type ipv4address = i32

And then I can parse an IPv4 address from a string to a 32-bit integer:

fun parseIPv4Address(address: string) : ipv4address option = ...

rel ipv4addresses(string, ipv4address option)
ipv4addresses("11.2.3.4", parseIPv4Address("11.2.3.4")).

And when I run with --dump-all I do see this relation:

ipv4addresses("11.2.3.4", some(184681220))

Where 184681220 is the decimal representation of that IPv4 address.

However, this won't work for IPv6 addresses since there is no corresponding native 128-bit type. Additionally, I'd like to extend this to represent CIDR ranges / subnetworks. For example, the CIDR 192.168.0.0/16 represents the range of IP addresses from 192.168.0.0 to 192.168.255.255. I'd like to represent this as an SMT bit vector where the first 16-bits are fixed (has the values from the first two octets, 192 and 168) and the last 16-bits are free. This representation makes it very easy to check if a (concrete) IP address is in the range of a CIDR. And I'll be using IP addresses and CIDRs to build out models of firewalls and routing tables.

So my questions are:

How can I define a user-type that is just an arbitrary-length SMT bit-vector?
Once that's done, I'll need functions that parse IP address (or CIDR) strings to their binary representations. I'll also need functions that can check if a particular IP address is within the CIDR range. Can these just functions written in ML+SMT? Or do I need something else to manipulate these bit-vectors?

Add support for character literals

We should support character literals like 'a'. At a minimum, we could restrict support to the front end, where they could be converted to i32s.

Document SMT strategies

We support different ways of interacting with the back-end SMT solver(s). These include strategies that reset the solver state between queries, as well as different approaches to caching queries to take advantage of incremental SMT solving. We should document these.

Read in options from configuration files

As a convenience, allows users to specify options in a configuration file, instead of requiring them to list the options on the command line.

Compiler generates invalid query plans

The Formulog-to-Souffle transpiler generates invalid query plans in the presence of the CODEGEN_PROJECT relation, which it seems to treat as being in the same stratum as the predicate being defined.

$ cat foo.flg
rel a(i32)
rel b(i32, i32)
b(3, 42).
a(0).
a(1).
a(X + 2) :- a(X), a(X + 1), !b(X + 2, _).
$ java -Doptimize=5 -jar target/formulog-0.7.0-SNAPSHOT-jar-with-dependencies.jar foo.flg -c
$ cd codegen
$ cmake -B build -S .
...
$ cmake --build build
-- Configuring done
-- Generating done
-- Build files have been written to: /root/codegen/build
[  6%] Generating formulog.cpp
Error: Invalid execution order in plan (expected 2 atoms, not 3) in file formulog.dl at line 30
.plan 0: (1,2,3), 1: (2,1,3)
---------^-------------------
Error: Invalid execution order in plan (expected 2 atoms, not 3) in file formulog.dl at line 30
.plan 0: (1,2,3), 1: (2,1,3)
---------------------^-------
2 errors generated, evaluation aborted
gmake[2]: *** [CMakeFiles/flg.dir/build.make:74: formulog.cpp] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:82: CMakeFiles/flg.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2

The offending rule in the generated Souffle is

a_(@expr_5(X)) :-
        a_(X),
        a_(@expr_4(X)),
        !CODEGEN_PROJECT_b_10(@expr_5(X)).
.plan 0: (1,2,3), 1: (2,1,3)

Add a Formulog tutorial

We could use a nice introduction to the language. Optimally, it would build up to an interesting example Formulog program (such as a symbolic evaluator for a simple input language).

CLI should load version from JAR

Right now, the Formulog version is hard-coded into the CLI, which is easy to forget to update (see #25). Instead, the version should be stored in the JAR (and maybe the git commit as well), and then we can load it programmatically from the CLI code. See:

Codegen: binary terminated by signal 13

Sometimes the binary we generate exits with code 141, signaling that it has been terminated by signal 13 (trying to write to a closed pipe). In particular, this occurs in the interp-N benchmarks for the symbolic executor. This seems to have been introduced by explicitly terminating Z3 child processes (#27).

Tests failing on Mac?

I just cloned formulog and tried to run mvn package, and I see some test failures. Here is some of the output:

Running edu.harvard.seas.pl.formulog.eval.SemiNaiveEvaluationTest
line 9:8 token recognition error at: '").\n'
line 11:0 no viable alternative at input 'foo("\\"\noutput'
Tests run: 166, Failures: 22, Errors: 0, Skipped: 0, Time elapsed: 5.341 sec <<< FAILURE!
test099(edu.harvard.seas.pl.formulog.eval.SemiNaiveEvaluationTest)  Time elapsed: 0.055 sec  <<< FAILURE!
java.lang.AssertionError: Unexpected exception:
edu.harvard.seas.pl.formulog.eval.EvaluationException: Exception raised while evaluating the rule:
ok :-
	none = get_model([_to_formula_normal_form(#x211), _to_formula_normal_form((~ #x211))], none),
	get_model([_to_formula_normal_form(#x211), _to_formula_normal_form((~ #x254))], none) -> some($102),
	query_model(#x211, $102) = some(true),
	query_model(#x254, $102) = some(false),
	query_model(#x289, $102) = none.

Exception raised while evaluating the literal: get_model([_to_formula_normal_form(#x211), _to_formula_normal_form((~ #x254))], none) -> some($102)

edu.harvard.seas.pl.formulog.eval.EvaluationException: Problem with solver 300:
Problem parsing solver output: Tried to consume "model", but found "(".
	at edu.harvard.seas.pl.formulog.eval.RoundBasedStratumEvaluator$RulePrefixEvaluator.doTask(RoundBasedStratumEvaluator.java:365)
	at edu.harvard.seas.pl.formulog.util.AbstractFJPTask.compute(AbstractFJPTask.java:40)
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)

I'm on macOS 10.14, and my Z3 version is 4.8.12. I saw the same failure with Java 8 and Java 11. If my config is off or I am missing a dep, please let me know. Thanks!

String elimination/projection functions needed

Currently there are 4 string manipulation functions (plus to_string), but inspection is only supported on the prefix or via regexp. There are no functions to extract from the string (presumably these would be i32 char values?) or convert to another form such as a list, so the utility of strings is significantly constrained.

Suggestions would be string_sub : string -> 'a -> 'b -> string where 'a and 'b are integer forms to extract a sub-string given an offset and a length, string_char : string -> i32 -> i32 to extract a single character (from which string_sub could be synthesized), or string_to_list : string -> i32 list.

Type-check error trying to cons a string to a string list, I'm probably cons-ing wrong

As part of trying to parse a string IPv4 address into its binary representation, I've written a string split utility function that takes an input string haystack and the character to split on needle and returns a list of strings. I started with:

fun stringSplit(haystack: string, needle: string) : string list =
  let haystackChars = string_to_list(haystack) in
  let needleChar :: _ = string_to_list(needle) in
  stringSplit_(haystackChars, needleChar, [], [])

And originally I wrote the helper function stringSplit_:

(* this one doesn't type check *)
fun stringSplit_(haystackChars: i32 list, needleChar: i32, currentGroup: i32 list, sofar: string list) : string list =
  match haystackChars with
  | [] => sofar
  | current :: nil =>
    let lastGroup = cons(currentGroup, current) in
    let lastString = list_to_string(lastGroup) in
    stringSplit_([], needleChar, currentGroup, cons(sofar, lastString))
  | current :: rest =>
    if current = needleChar
    then
      let part = list_to_string(currentGroup) in
      stringSplit_(rest, needleChar, [], cons(sofar, part))
    else
      stringSplit_(rest, needleChar, cons(currentGroup, current), sofar)
  end

But I'm getting a type-check error:

Exception: Problem type checking function: stringSplit_
Type error in function: stringSplit_
match haystackChars with
	| [] => sofar
	| [current] => match (currentGroup :: current) with
	| lastGroup => match list_to_string(lastGroup) with
	| lastString => stringSplit_([], needleChar, currentGroup, (sofar :: lastString))
end
end
	| (current :: rest) => match beq(current, needleChar) with
	| true => match list_to_string(currentGroup) with
	| part => stringSplit_(rest, needleChar, [], (sofar :: part))
end
	| false => stringSplit_(rest, needleChar, (currentGroup :: current), sofar)
end
end
Cannot unify string and string list list
Problematic term: lastString

Which is surprising to me. If I reverse the order of arguments in my calls to cons I type check just fine but the lists are in the reverse order. stringSplit("11.2.3.4", ".") outputs ["4", "3", "2", "11"]. I can use the rev function from one of the included examples:

fun rev(Xs: 'a list) : 'a list =
  let fun cons_wrapper(Xs: 'a list, X: 'a) : 'a list = X :: Xs in
  fold[cons_wrapper]([], Xs)

And then reverse the lists where appropriate:

(* this one works *)
fun stringSplit_(haystackChars: i32 list, needleChar: i32, currentGroup: i32 list, sofar: string list) : string list =
  match haystackChars with
  | [] => rev(sofar)
  | current :: nil =>
    let lastGroup = cons(current, currentGroup) in
    let lastString = list_to_string(lastGroup) in
    stringSplit_([], needleChar, currentGroup, cons(lastString, sofar))
  | current :: rest =>
    if current = needleChar
    then
      let part = list_to_string(rev(currentGroup)) in
      stringSplit_(rest, needleChar, [], cons(part, sofar))
    else
      stringSplit_(rest, needleChar, cons(current, currentGroup), sofar)
  end

My questions:

My mental model for cons was that it was adding an item to the end of the list. I'm guessing it actually pre-pends the item. Is that correct?
Is there a way to construct this list so I don't have to reverse it later? I suppose it's not much of an actual concern but it feels like its a bit messy and maybe I'm doing some unnecessary work.
I feel like I've got some repeated code in the various pattern matching blocks in stringSplit_. This pattern matching took a while to figure out and I'm not thrilled with my solution of having both a current :: rest block and a current :: nil block. Is there a better (more concise, more idiomatic) way to do this pattern matching block?
Stepping back a bit, it would be nice to import or use a standard library where this function is already implemented instead of writing my own. What flavor of ML does formulog support? And can I import or otherwise re-use some existing ML standard library code somehow?

Consider adding SMT framing

Using is_sat in the body of a rule doesn't use any framing, so even if the solver has discharged the sat determination, it remains in memory within the solver and can't be discarded even if it will no longer be used. When the rule is being used to process a large number of intensional facts, this can cause significant memory consumption overhead in the solver.

For example:

    @external
    input values(i32)

    output solve(i32, i32, i32, i32)
    solve(X, Y, Z, A)  :- 
        values(X), values(Y), values(Z),
        X != Y, X != Z, Y != Z,
        2020 = X + Y + Z,
        A = X * Y * Z.

This trivial program generates results for all triples that sum to 2020. For a values.tsv file with 1000 entries, this executes in Formulog on the order of 9 seconds on my system.

Changing the line

    2020 = X + Y + Z,

    is_sat(`2020 #= bv_add(bv_add(X, Y), Z)`),

can generate millions of SMT terms. While each of these is trivially sat/unsat by Z3, I would expect this to run for some lengthy period of time and produce the same answer as the Datalog-only version. Instead, this runs for tens of minutes with the Z3 process slowly accumulating more memory consumption before dying from memory exhaustion on my 32GB machine. Using SMT framing should allow previous results to be garbage collected and avoid this memory exhaustion.

List operations

From examples/liquid_types.flg, there appears to be an automatic (builtin?) type of list, with pattern matching on [] and :: operations. These are undocumented as far as I can tell.

Additionally, :: appears to be an infix constructor, but there's no discussion of infix operators elsewhere (that I saw, maybe I missed it); is :: unique or is there a language syntax to declare infix operators or constructors and define precedence and affinity?

What are the semantics of ::? Is X :: Y :: ZList a valid pattern to match the first two elements of a list?