egraphs-good / egg Goto Github PK
View Code? Open in Web Editor NEWegg is a flexible, high-performance e-graph library
Home Page: https://egraphs-good.github.io
License: MIT License
egg is a flexible, high-performance e-graph library
Home Page: https://egraphs-good.github.io
License: MIT License
Right now this checked the string for a leading "?". It would be nice if users could use it for cheap string interning.
egg should panic when the user duplicates rule names in rewrite!
.
Herbie has an optimization that tracks the rules applied for each class, so that we can skip those rules if the class doesn't change. This could speed up searching quite a bit when each class is big. This could be a simple annotation once we have class metadata.
Right now the pattern search machine isn't reused between pattern searches, so it reallocates a lot.
I can fix this with some unsafe, but doing it safely will require an API change.
I created a copy of the math.rs
test that uses a custom language based on a struct:
#[derive(Clone, Debug, Hash, PartialEq, Eq, PartialOrd, Ord)]
enum MathOp {
// 2 args
Diff,
Integral,
Add,
Sub,
Mul,
Div,
Pow,
// 1 args
Ln,
Sqrt,
Sin,
Cos,
// 0 args
Constant(Constant),
Symbol(Symbol),
}
#[derive(Clone, Debug, Hash, PartialEq, Eq, PartialOrd, Ord)]
pub struct Math {
op: MathOp,
children: Vec<Id>,
}
Running the benchmarks shows this to take 15-30% more cycles for each test:
Test | math | math_struct | math / math_struct | 1-(math / math_struct) |
---|---|---|---|---|
diff_power_harder | 21,074,499 | 30,392,699 | 69.34% | 30.66% |
diff_power_simple | 5,047,100 | 7,156,238 | 70.53% | 29.47% |
integ_one | 1,329,639 | 1,533,784 | 86.69% | 13.31% |
integ_part1 | 8,189,299 | 12,439,484 | 65.83% | 34.17% |
integ_part2 | 30,383,957 | 48,027,206 | 63.26% | 36.74% |
integ_part3 | 3,705,544 | 5,267,235 | 70.35% | 29.65% |
integ_sin | 1,340,420 | 1,547,473 | 86.62% | 13.38% |
integ_x | 1,401,729 | 1,639,932 | 85.47% | 14.53% |
associate_adds | 60,466,286 | 80,325,970 | 75.28% | 24.72% |
diff_different | 1,337,645 | 1,543,119 | 86.68% | 13.32% |
diff_ln | 1,350,054 | 1,556,474 | 86.74% | 13.26% |
diff_same | 1,321,303 | 1,498,483 | 88.18% | 11.82% |
diff_simple1 | 3,352,771 | 4,654,982 | 72.03% | 27.97% |
diff_simple2 | 3,018,403 | 4,165,698 | 72.46% | 27.54% |
powers | 1,440,284 | 1,731,722 | 83.17% | 16.83% |
simplify_add | 2,648,652 | 3,418,225 | 77.49% | 22.51% |
simplify_const | 1,746,411 | 2,237,483 | 78.05% | 21.95% |
simplify_factor | 8,295,182 | 11,505,095 | 72.10% | 27.90% |
simplify_root | 11,303,834 | 15,622,872 | 72.35% | 27.65% |
I suspect the problem is the use of discriminant
, which causes all nodes to be considered the same, preventing short-circuiting of some searches and possibly inability to use lookup classes by operator, etc.
As a first step, I can send a PR with the test / benchmark. I think @mwillsey you had mentioned some planned work that would address this, so perhaps this issue can just serve as a tracking issue for that work?
This would be helpful (and avoid surprises) in cases where egg is being used as part of a larger project that already has a language defined.
This maybe more of an RFC than an issue, but right now, union will not complain if two eclasses A and B are merged where both A and B contain distinct constants.
Some hypothetical clients may want support for such situations, but an eclass containing distinct constants will usually be a bug in a rewrite rule or other equivalence finder running over the egraph.
A straightforward fix would be to extend the range type of the eclass's hash map (currently Vec<ENode>
) in the egraph type to also track what constant (if any) an eclass contains. Then union can quickly check for compatibility during merging.
More interestingly, this would be a first step towards supporting client-specific equivalence finders, constant folding in particular. I'm not sure what that interface looks like yet, but getting it right will be an important design challenge.
If one of the timeout checks fails, the runner will not report the final iteration.
It's common to have rewrite rules to apply both ways, so it'd be convenient to have a bidirectional rewrite! macro. Such a macro (say equiv!
) can generate a vector of 2 rules, and the user can use it like so:
vec![
equiv!("assoc"; "(* a (* b c))" = "(* (* a b) c)"),
equiv!(...),
...
].concat()
This will also force the user to separate bidirectional rules from one-way rules, which is good.
There is an initial rebuild at the beginning of run
in run.rs
that is there to create the initial classes_by_op
hash table in the egraph. The time for this rebuild (usually very small) is not tracked by the iterations vector in the runner.
Here's a link: http://taktoa.me/eqsat/Thesis.pdf
Unfortunately, my implementation didn't get very far. UIUC's senior thesis credit grades based on what you write, not what you implement.
It took me a while to realize that to get the EClass of a given Id you need to do egraph[id]
.
The new Analysis::merge
could use a bit more doc, along the line of:
Analysis::Data
must form a partial order. For example, the "constant" analysis for i32
forms the order None < Some(x: i32)
for any x
(but any pair of Some(x), Some(y)
are not comparable). merge
modifies to
to be the least upper bound of to
and from
. For example, take merge(a, b)
:
b = None
there is no need to modify a
, as b
contains no information (is the bottom of the partial order). We know for sure b <= a
.a = None
and b = Some(2)
modifies a
to be Some(2)
, as b > a
.a = Some(1)
and b = Some(1)
there is no modification.a = Some(1)
and b = Some(2)
egg should panic, as the points have no LUB (they give conflicting information for what the constant should be).Is the above consistent with the implementation? I also can't figure out what the ordering returned by merge
is supposed to mean from the tests.
Add functionality in test_fn
to specify that set of rules do not rewrite term A
to term B
. Additionally, use the delta debugging crate to discover a minimal set of rules that do rewrite A
to B
to make debugging easier.
One thought: this should either not be turned on by default OR only run once per run of a test suite. This can take a while to run, especially on large rule sets. A single unsoundness may trigger many tests to fail, but the result of rerunning DD will usually not give much more information. I still think it would be good to complete running the entire test suite, but maybe having a DD_HAS_RUN flag somewhere and checking it before running.
It seems the depth
argument in union_depth
is never used but for debugging, and it is always 0. Should it be removed?
I think there are some nits in the lambda test. For example, Analysis::make
inserts the ID of variables to free
, but lambda
takes a symbol and CaptureAvoid::apply_one
tests if that symbol is in free
, which will always be false. To see this, changing lambda_capture_free
to the following will fail:
egg::test_fn! {
#[should_panic(expected = "Could not prove goal 0")]
lambda_capture_free, rules(),
"(let y (+ (var x) (var x)) (lam (var x) (var y)))" => "(lam (var x) (+ (var x) (var x)))"
}
Similarly, the test cases also pass symbols to let
and fix
, which Analysis::make
would add to free
.
Furthermore, in my implementation I consistently use variables instead of symbols, but implementing the free variable test as in CaptureAvoid
does not work. The ID of v2
never matches any ID stored in free
(any clue why this happens? Do I have to call find
?) Instead, I had to test if the free var of v2
is in the free vars of e
.
This will help avoid some silly bugs.
Hello! I'm running into interesting behavior when I create a Runner with a very short timeout (say 200 microseconds) and a large number of rules.
It tells me Saturated
, and I extract the minimal expression, but if I increase the timeout (say to 400 microseconds), I again get Saturated
with a smaller minimal expression. Unless I'm misunderstanding saturation, this behavior would seem to mean that the initial Saturated
status was incorrect.
I think the source of this issue is the break
statements inside run_one
:
https://github.com/egraphs-good/egg/blob/master/src/run.rs#L423
https://github.com/egraphs-good/egg/blob/master/src/run.rs#L452
With a short timeout, these prevent the loop(s) from iterating through all the rules, meaning applied
may be empty, even though there could be further work to do. The empty applied
causes run_one
to give a Saturated
result.
As a temporary fix, does it make sense to just comment out those two break
s? This will allow the runner to exceed the limits, but only by one "iteration".
Maybe a better fix would be to change the two checks of the form
if self.check_limits().is_err()
break;
}
to something like
result = result.and(self.check_limits());
if result.is_err() {
break;
}
(where result
is initially Ok(())
), and then run_one
could return result
if its an Err
? This approach is what I'm doing right now, and it seems to be working.
Let me know if I'm completely missing something here. Otherwise, I can create a pull request for the above, if it would help!
This should panic instead
https://github.com/mwillsey/egg/blob/db793b0117dd25349ba483248c04ab90bb241536/src/macros.rs#L304
It's because here, each if
is used to create another ConditionalApplier
, meaning the last if
is the outermost ConditionalApplier
and is evaluated last.
A potential solution that I haven't had the time to test yet: turn the if
s into a list of some kind, then fold over the list in the reverse order.
Another potential solution: can you just reverse the order of matched things in a pattern that matches repetitions?
It seems like Analysis data should be accessible when implementing a CostFunction, but it's not!
The rust std library formatting traits (Display
, Debug
, Binary
, etc.) are all structured the same way: they have a single method
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result
Language::display_op
should follow the same convention, either by changing its signature:
fn display_op(&self, f: &mut fmt::Formatter) -> fmt::Result
or by separating its functionality out into a new trait:
pub trait DisplayOp {
pub fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result;
}
Beyond just consistency, this change would also have the added advantage that the operator's string representation could be generated on-the-fly, whereas the current signature (returning &dyn Display
) requires the implementer to return a reference to some value that already exists. I'm happy to submit a pull request making this change.
The number of eclasses doesn't represent ways to compute the expression, but rather number of none-empty distinct subexpressions.
https://github.com/mwillsey/egg/blob/eed5d0a6407e536d0b14ccc81dfe2e57c0256742/tests/math.rs#L477
I found a degenerate case using rules similar to the math example. When confronted with an expression like (* x 0)
it seems to first apply both zero-mul
and comm-mul
to get that (* x 0) = (* 0 x) = 0
. After that, each iteration seems to run more and more instances of assoc-mul
. I think this makes arises since the x
is equivalent to a multiplication, leading to: (* x 0) = (* x (*x 0)) = (* x (* x 0))
.
The result is that this simple expression times out after 10 seconds. Is there a way to prevent this expansion? I wonder if it would be enough to add guards on the associativity rules that the nested expression isn't a constant?
If there is a way, I'm happy to figure out and add a reasonable case to the existing tests for it.
In some cases it would be useful to access the analysis data after extraction. For instance, the analysis data may include information about constraints or type information that are useful during rewriting (for instance, only apply certain rules to certain types of nodes). This information may also be useful for lower level code-generation from the extracted expression.
But, the extracted Ids are not related to the IDs in the graph, making the analysis information difficult to access.
Some possible options:
extract_with_analysis
) that collects the analysis information for each node added to the RecExpr
.I think the former would be pretty easy to add, if I'm reading things right:
/// Find the cheapest (lowest cost) represented `RecExpr` in the
/// given eclass.
pub fn find_best_analysis(&mut self, eclass: Id) -> (CF::Cost, RecExpr<L>, Vec<N::Data>) {
let mut expr = RecExpr::default();
let mut data = Vec::new();
let (_, cost) = self.find_best_analysis_rec(&mut expr, &mut data, eclass);
(cost, expr, data)
}
fn find_best_analysis_rec(&mut self, expr: &mut RecExpr<L>, data: &mut Vec<N::Data>, eclass_id: Id) -> (Id, CF::Cost) {
let eclass = self.egraph[eclass_id];
let (best_cost, best_node) = match self.costs.get(&eclass.id) {
Some(result) => result.clone(),
None => panic!("Failed to extract from eclass {}", eclass.id),
};
let node = best_node.map_children(|child| self.find_best_rec(expr, child).0);
data.push(eclass.data.clone());
(expr.add(node), best_cost)
}
It may be cleaner to create a wrapper around the data that allows indexing with Id
.
Did I miss a way to get the analysis data after extraction? If not, would the above method be generally useful? Another option would be to allow the recursive call to take a function that is applied to each eclass
. Then one version does nothing (and hopefully gets compiled away) while the other version pushes the data to the vector. This could also allow for versions that push just a part of the data, etc.
The main one I can think of is an applier that adds a new expression based on a matched pattern, but does not substitute it (i.e. doesn't union
). Instead it's just appended to the e graph as is. This could be used to create "deductions" that could then be used by conditional rewrite rules. Consider the use case of inferring properties of expressions based on the properties of their constituents. For example, a commutativity rewrite rule might be conditional on whether the operator in question is indeed commutative on the operands. The language might include set-theoretic terms like x is-a-member-of N
which can be checked for with a condition so that the term (x + y) is-a-member-of N
can be generated, and then the commutativity substitution (x + y) + x
to x + (x + y)
can be made.
I'm not sure how much this overlaps with the "analysis" feature, but it seems like it would be more powerful.
Mostly curious, from my own projects/converting between normal forms, In the the prop.rs example, Introducing BiImplication tends to increase the complexity of the ast, because assuming you add an enum variant like: "<->" = BiImplication([Id; 2]),
and a case to make Prop::BiImplication([a, b]) => Some(x(a)? || !x(b)?) && Some(x(b)? || !x(a)?),
,
you would need to either clone
, or Rc
, a
and b
.
So introducing BiImplication causes the ownership to get a bit hairier.
I thought it'd be interesting to see how that small change affects the otherwise clean example,
however I haven't tried it in egg yet.
Would be nice to have per-rule customization of scheduling. For example, apply BackOffScheduler only to certain rules.
The common flow of running should support extracting at a set of roots.
This will also require a Runner to be able to add stuff to the report.
I'm writing a very custom Language
implementation and I suspect that maybe there is a bug in my implementation that is causing this assertion in "egraph.rs" to fail. Question is, I'm not sure what sort of problem leads to "Found unexpected equivalence", so maybe the text of that assertion could be expanded to include details of what the likely cause is?
We should implement constant folding. The Language
trait should provide the functionality to do that.
After pruning, the egraph may not be connected.
We should be able to "garbage collect" unreachable eclasses given a set of eclass roots.
Sometimes you only want to rewrite if something is not equal.
This gets a little dicey because things may become equal later, but we should support it as it's pretty common.
I would like to use egg
in a compiler for a programming language, but the compiler is written in a language without FFI to Rust or C. I wrote a poor version of egg in the meantime, but it would be ideal to be able to call an egg
subprocess or similar.
I am struggling to implement the Analysis
trait correctly, especially the merge
function. I got very good at producing endless loops or failing asserts in the caller function. I have something working now for a very simple data type but I am not certain why it works. There are more elaborate examples in the paper and also some explanation, but I am still struggling. Also in the main branch the return value is now Option<Ordering>
but I am not sure what is happening exactly depending on the respective four different results.
Please consider to add a bit more information there what invariants need to hold in the merge operation and what is happening when this function is called.
I'm just dropping in my requested feedback in a place that's easy to find. I haven't moved on yet to using Egg, so this is just based on reading the docs:
I'll add other thoughts to this issue as I move on to playing around.
Noticed this when trying to use a generic language like:
struct MyLang {
op: OpKind,
children: Vec<Id>,
}
When used as a pattern, search
fails because it uses std::mem::discriminant(...)
to get the key of the enum, which is unspecified behavior.
This restriction should at least be documented, since it affects the set of valid languages.
One suggestion would be to add something to the Language
trait that allows the user to change the discriminant. In my case, the OpKind
could be used to identify operators.
For instance:
trait Language {
type OpDiscriminant: PartialEq + Eq + Hash + Debug;
fn discriminant(&self) -> OpDiscriminant
}
For languages that based on enums, this could be implemented as:
impl Language for EnumLang {
type OpDiscriminant = Discriminant<Self>;
fn discriminant(&self) -> OpDiscriminant { std::mem::dicriminant(self) }
}
For languages like mine that use a struct, this could be:
impl Language for MyLang {
type OpDiscriminant = OpKind;
fn discriminant(&self) -> OpDiscriminant { self.op }
I'm not sure if this affects other parts of the code. If not, I'd be happy to try creating a pull request to address this if the approach outlined above seems reasonable.
Right now the runner takes over, and prevents you from doing custom things with the egraph.
This method was present in the latest release but no longer exists in the development version. I don't see anything in the changelog about it. I have the same question about Language::children_mut()
.
I'm on Mac, where stack sizes are small, and the egraph I'm searching through is large...but still, I assume this is not expected behavior?
I'll post a link to a demo you can run yourself in a bit!
It's not recursive anymore.
There should be some way to classify rewrites as expansive or not, and those that are should have some kind of fuel applied to them.
Can we do this classification automatically?
Description
Currently the left hand side and the right hand side of a rewrite rule can only contain one subtree each, with the equivalent nodes being the two root nodes. It would be helpful to support more general cases where each side can contain multiple subtrees (potentially with shared nodes), and a rewrite rule can have multiple matched equivalent nodes.
Example cases:
Here split_0 (and split_1) refers to taking the first (and second) element from the split operation. matmul is matrix multiplication. There are two matched outputs: (matmul ?input_1 ?input_2) is equivalent to (split_0 (matmul ?input_1 (concat ?input_2 ?input_3))), and (matmul ?input_1, ?input_3) is equivalent to (split_1 (matmul ?input_1 (concat ?input_2 ?input_3))). They can't be specified as separate rules, since to apply (matmul ?input_1 ?input_2) => (split_0 (matmul ?input_1 (concat ?input_2 ?input_3))), we need to match ?input_3 in the egraph before as well (needs to also be matmul'ed with ?input_1).
This case differs from above in that left hand side contains two separate subtrees, where case 1's LHS share common nodes between (matmul ?input_1 ?input_2) and (matmul ?input_1, ?input_3). Both would be useful.
Some initial idea on how to achieve this
Maybe we can still use the RecExpr to represent the patterns on both sides, but adding a mapping of the matched output pairs (instead of the two last elements of LHS and RHS). And when matching LHS (and applying RHS), we can keep a record on which indices of the RecExpr that hasn't been matched (and applied), so that we make sure all elements are matched (and applied) before we finish.
The documentation suggests importing egg's interned String as egg::Symbol
, but this fails to build in a clean new project (created with cargo new
, egg v0.4.1
):
Compiling autocfg v1.0.0
Compiling log v0.4.8
Compiling cfg-if v0.1.10
Compiling symbolic_expressions v5.0.3
Compiling smallvec v1.4.0
Compiling once_cell v1.4.0
Compiling instant v0.1.4
Compiling indexmap v1.4.0
Compiling egg v0.4.1
Compiling egg-test v0.1.0 (/Users/avh/research/egg-test)
error[E0432]: unresolved import `egg::Symbol`
--> src/main.rs:1:5
|
1 | use egg::Symbol;
| ^^^^^^^^^^^ no `Symbol` in the root
When using dot
, a loop from a node to it's own class is displayed as being to the node, rather than the cluster.
It would help with debugging if merge
took two eclasses as arguments instead of their analysis data. Consider a situation where two eclasses incorrectly get merged due to some bug. The debug log currently shows us which rule firing led to this situation. It however does not tell us which two eclasses were merged. I think changing the signature of the merge
function will allow a user of egg to get this information and more easily debug their egg-based application.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.