Git Product home page Git Product logo

Comments (10)

alamb avatar alamb commented on August 17, 2024 1

Oh yeah, you are right, I don't know why I didn't think about that. I will try it anyway, maybe I will come up with a different solution. Thanks Peter, for letting me know!

In order to make it work, you might be able to implement a custom iterator like

enum InputIterator<'a> {
  SingleInput(&LogicalPlan),
  VecInput(&[LogicalPlan])),
..
}

And then implement the Iterator trait appropriately

from arrow-datafusion.

alamb avatar alamb commented on August 17, 2024 1

If you guys really think this can be helpful i can open up a PR so you can look at some details, but it is maybe worthwhile to just wait until Rust allows us to return multiple types in a -> impl Trait function.

I agree your (clearly very impressive skills) might be better spent on other projects.

Is there any particular issue or area you are interested in working on?

from arrow-datafusion.

LorrensP-2158466 avatar LorrensP-2158466 commented on August 17, 2024 1

This is very cool -- is the code somehere pubic?

I can make my implementation public AcyclicJoin Node. In short, it introduces a new logical node (acyclic join) that is coupled with a physical node (join impl of that acyclic join), but the physical node is part of that PhD project, which I can't share. To create those acyclic join nodes, I had to detect if a particular join tree (i.e. subsequent joins in a LogicalPlan) is acyclic or not.

I also think adding an example of SQL analysis would be nice. I'll move to #10871 so this can be closed.

from arrow-datafusion.

alamb avatar alamb commented on August 17, 2024 1

I can make my implementation public AcyclicJoin Node. In short, it introduces a new logical node (acyclic join) that is coupled with a physical node (join impl of that acyclic join), but the physical node is part of that PhD project, which I can't share. To create those acyclic join nodes, I had to detect if a particular join tree (i.e. subsequent joins in a LogicalPlan) is acyclic or not.

That is very cool -- thank you for sharing @LorrensP-2158466 . Since DataFusion is totally open it is always cool to see what people are doing with it.

(BTW if you need a paper about DataFusion to cite -- we now have one https://dl.acm.org/doi/10.1145/3626246.3653368 :) )

from arrow-datafusion.

alamb avatar alamb commented on August 17, 2024

I believe @peter-toth and @ozankabak and I discussed something similar in #10543 (comment)

The new reference walking APIs added in #10543 (which I don't think are released yet) may also be related -- specifically TreeNode::visit can now return references that can be kepy of the underlying nodes

from arrow-datafusion.

peter-toth avatar peter-toth commented on August 17, 2024

@LorrensP-2158466, I as far as I get you just want to change the return type of LogicalPlan::inputs() to use iterators instead of the current Vec. I think that is not really related to #10543 and actually can be a good idea to avoid unnecessary Vec allocations.

But I feel you will run into problems with the impl Iterator<Item = &LogicalPlan> return type as it doesn't allow returning different iterator implementations on different LogicalPlans. (And I think we want to avoid dyn Iterator too.)
But please give it a try if you have some time...

from arrow-datafusion.

LorrensP-2158466 avatar LorrensP-2158466 commented on August 17, 2024

Oh yeah, you are right, I don't know why I didn't think about that. I will try it anyway, maybe I will come up with a different solution. Thanks Peter, for letting me know!
I will also look at #10543 for extra info.

from arrow-datafusion.

LorrensP-2158466 avatar LorrensP-2158466 commented on August 17, 2024

So I have found a solution that i can compile, but at this stage I'm not very happy with it.
I will first try to explain my reasoning as to how i got to my current solution, it is a bit much, so if you are only interested the final you can Jump to the end. I did underestimate this problem...

Explanation

I first tried the solution Andrew mentioned, but it runs into following problems:

  1. The Join Plans and Recursive Query
    Both the Join and CrossJoin hold the left and right inputs separately in an Arc, so we need another variant Double() which needs to know if the first input is retrieved and than the second. This is because you can't transform both the Arc's into a &[LogicalPlan]. This also applies to Recursive Queries which have 2 inputs.
  2. The Union node holds a Vec<Arc<LogicalPlan>> which you also can't transform safely into a &[LogicalPlan].
  3. The Extension node doesn't allow us to know how the inputs are stored, we have to assume this can have multiple inputs all contained in Arc's, so we get the same problem as the Union Node but a little worse because some Extension nodes just have 1 input but we treat them as having multiple.

To support all these cases i came up with:

pub enum InputIterator<'parent> {
    Empty(Empty<&'parent LogicalPlan>),
    Single(Once<&'parent LogicalPlan>),
    Double(DoubleIter<&'parent LogicalPlan>), // DoubleIter<T> = std::array::IntoIter<T, 2>
    Multiple(std::vec::IntoIter<&'parent LogicalPlan>),
}

Built-In LogicalPlans

Most cases can be handled by the Empty, Single or Double variant.
The Multiple variant has a Vec::IntoIter, so we can handle the Union case easier, this is because we have to transform the Vec<Arc<LogicalPlan>> into a collection of references, which results in us allocating and transforming it back into a iterator. This also happens in the inputs()function but with just the collect() call.

Extension Plans

To handle the Extension nodes i added the following function into the UserDefinedLogicalNodetrait:

fn inputs_iter(&self) -> InputIterator<'_>;

The user of this trait can then choose their own iterator, and we only have to call extension.node.inputs_iter().

More Problems

But there are still some problems with this (i think):

  1. The Multiplevariant still causes us to allocate in a lot of cases, some UserDefined plans in examples have their inputs stored as Vec<LogicalPlan>, so we need to do inputs.iter().collect().into_iter().
    We could add another Iterator type:
Slice(slice::IntoIter<&'parent LogicalPlan>)

which is an slice iterator, so the above would be inputs.as_slice().into_iter(). But adding another variant that still says "i have multiple inputs" does not look nice.
2. Some Nodes, like Union hold their inputs in Vec<Arc<LogicalPlan>> which also causes us to transform and collect this into a Vec<&LogicalPlan> to turn this into a iterator the Multiple variant accepts.

Solutions

To fix this i can split up the Multiple variant into 2 new variants:

pub enum InputIterator{
    // ...
    Slice(slice::IntoIter<&LogicalPlan>),
    FromArcs(Map<Iter<'_, Arc<LogicalPlan>>, AsRef::as_ref>), // maps Arc<LogicalPlan> into &LogicalPlan
}

This does sum up to a total of 5 different iterator types, but i don't know how i can cover every possible way of holding onto multiple inputs. For example if node implementation holds their inputs in some other collection (like BTreeMap, HashMap, ...) their respective Iter implementations have different types, so if someone wants to return a InputIterator they are forced to also keep their inputs in a Vec<LogicalPlan> or Vec<Arc<LogicalPlan>>.

Currently

So currently the InputIterator looks like this...

pub enum InputIterator<'parent> {
    Empty(Empty<&'parent LogicalPlan>),
    Single(Once<&'parent LogicalPlan>),
    Double(DoubleIter<&'parent LogicalPlan>),
    Slice(SliceIter<'parent, LogicalPlan>),
    FromArcs(FromArcsIter<'parent, LogicalPlan>),
}

I have made a macro which let's me "delegate" a call to this iterator to any of the inner iterators, e.g. fn next():

fn next(&mut self) -> Option<Self::Item> {
    delegate!(self, iter => iter.next())
}

Things to do:

  • find a way to accept many more IteratorTypes, (maybe just have a fallback in the form of Vec::into_iter or dyn Iterator).
  • Minimize Iter variants while still keeping the implementation clear, maybe we can collapse the Empty, Single and Double into one variant, but this does cause more checks.

If you guys really think this can be helpful i can open up a PR so you can look at some details, but it is maybe worthwhile to just wait until Rust allows us to return multiple types in a -> impl Trait function.

from arrow-datafusion.

LorrensP-2158466 avatar LorrensP-2158466 commented on August 17, 2024

Thanks, that means a lot. I'm interested in anything data science or database related. I came in contact with DataFusion because of my Bachelor's project this year, and I liked it so much that I wanted to contribute in any way I could. But because of school, I can't find the time to do this actively. The project was about detecting α-acyclic joins to help out a PhD project at my University, so I had to use LogicalPlans and the logical optimizer, where I came up with this "issue." Now that it was done, I wanted to try it out.

As I said, I like to help wherever I can, but I'm not familiar enough with DataFusion to know where I can help. Maybe you guys know some places where I can look/help?

Thanks for the interest and help in this issue!

from arrow-datafusion.

alamb avatar alamb commented on August 17, 2024

The project was about detecting α-acyclic joins to help out a PhD project at my University, so I had to use LogicalPlans and the logical optimizer, where I came up with this "issue." Now that it was done, I wanted to try it out.

This is very cool -- is the code somehere pubic ? Hopefully doing this kind of analysis would be easier now with the nicer tree node API from @peter-toth .

One thing that might be interesting / relevant perhaps then would be to add an example of that kind of analysis.

For example, this ticket describes an example for an analyzer rule #10855, but writing an example of sql_analaysis.rs that shows how to parse some sql and then use DataFusion structures to do an analysis (like maybe join counts, or predicate analysis or something) would be really neat -- I filed #10871 to track this

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.