Git Product home page Git Product logo

Comments (14)

alamb avatar alamb commented on August 26, 2024 1

🤔

I believe the core reason for having TableSource is precisely to avoid the dependencies on SessionState (so it is possible to plan SQL without having the entire datafusion stack)

https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.TableSource.html#method.supports_filter_pushdown

from arrow-datafusion.

cisaacson avatar cisaacson commented on August 26, 2024 1

@alamb I will see if I can try it out in the next few days.

from arrow-datafusion.

cisaacson avatar cisaacson commented on August 26, 2024 1

@alamb I fully agree, that PR will make what is needed feasible. This will be much easier to do once #11516 is merged, I'll wait until that is done and then work on this again.

from arrow-datafusion.

alamb avatar alamb commented on August 26, 2024

I think the usecase of passing some sort of state to TableProvider methods is a good and useful thing

https://docs.rs/datafusion/latest/datafusion/datasource/provider/trait.TableProvider.html

One way is to add a SessionState parameter as you suggest to all the methods. Given that it is already passed to scan, this seems a pretty reasonable change to me. While it would be an API change it would be very mechanical to add.

However, it seems like if we ever are going to break apart the core crate more, we'll have to figure out how to split out SessionContext. We have a related discussion here: #11182

Also related slack coversation in ASF slack: https://the-asf.slack.com/archives/C04RJ0C85UZ/p1719691754005389

from arrow-datafusion.

cisaacson avatar cisaacson commented on August 26, 2024

Thanks @alamb , makes very good sense. We can separate the 2 concerns. The only issue I see is that TableSource also references supports_filters_pushdown, that is where I got stuck trying to implement that. If we can figure out how to expose SessionState to TableSource this won't be hard to do at all. Let me know if I can help. If we implement this we should also make all args as SessionState (right now execute uses TaskContext, they should be the same).

from arrow-datafusion.

cisaacson avatar cisaacson commented on August 26, 2024

@alamb That's right, and therein lies the conundrum... We would need to solve that somehow, that's why I suggested the restructuring. If that were possible I would be submitting a PR to do this 😄

from arrow-datafusion.

alamb avatar alamb commented on August 26, 2024

When does the TableSource::supports_filter_pushdown get called I wonder 🤔 Is it possible just to use TableProvider

from arrow-datafusion.

cisaacson avatar cisaacson commented on August 26, 2024

@alamb Good point, I could thought of that but was not sure. I found this: datafusion/expr/src/logical_plan/plan.rs which calls TableSource.supports_filters_pushdown but that is only for Display of a LogicalPlan (certainly not critical). Further, the TableSource.supports_filters_pushdown is @deprecated, so maybe the best thing is to just remove all TableSource.supports_pushdown_filters altogether? That would be nice and much easier.

If you like that idea I can experiment in my local clone of the project.

from arrow-datafusion.

alamb avatar alamb commented on August 26, 2024

Further, the TableSource.supports_filters_pushdown is @deprecated, so maybe the best thing is to just remove all TableSource.supports_pushdown_filters altogether? That would be nice and much easier.

If it is deprecated, I think making a PR to remove the deprecated functions would certainly be a good step forward as it would keep things simpler

from arrow-datafusion.

cisaacson avatar cisaacson commented on August 26, 2024

@alamb That would be great. I do wonder though, the TableSource is used when the LogicalPlan invokes supports_filters_pushdown, correct? I don't know enough to say if that is required, or if it is only in a PhysicalPlan which is all a custom data source cares about.

from arrow-datafusion.

alamb avatar alamb commented on August 26, 2024

I am not sure either -- we would have to do some more research (potentially try it, for example) to find out

from arrow-datafusion.

cisaacson avatar cisaacson commented on August 26, 2024

@alamb I had some time and tried a very simple workaround, it would be up to you and the team if it is useful.

I added a single arg to: TableProvider.supports_filters_pushdown:

    fn supports_filters_pushdown(
        &self,
        task_id: &Option<String>,
        filters: &[&Expr],
    ) -> Result<Vec<TableProviderFilterPushDown>> {
        Ok(vec![
            TableProviderFilterPushDown::Unsupported;
            filters.len()
        ])
    }

This only touched 9 files, and given the task_id I can access the state I have set up. This was far simpler than adding SessionState, for my use case that wasn't necessary. So while this is different than other places, it does not add more dependencies on the current SessionState, which I think is a good thing. Note that this could have been TaskContext but even that is more than I need, the task_id alone does the trick.

This avoids TableSource altogether, but I may need it there too.

However, even this was easy to do the issue is that likely remains is who calls supports_filters_pushdown in a PhysicalPlan? It see that LogicalPlan calls it which puts us back to TableSource. Do you know the calling structure for this? If we need to add it to TableSource then the same issue remains, where to carry the TaskContext and the task_id so it is provided to TableProvider implementations (which requires the SessionState or SessionContext.

I believe we would need to add task_id here: LogicalPlan::TableScan(scan), which is the TableScan struct. If the task_id were supported wherever that was created it might work.

If you or someone on the team can help me solve this part of it then this is a very simple change. At least I understand the issue now.

Let me know what you think.

from arrow-datafusion.

cisaacson avatar cisaacson commented on August 26, 2024

After trying to modify LogicalPlan::TableScan with the extra field, it quickly balloons into needing to change a lot of files (optimizer, etc.). Since scan works it seems like supports_filters_pushdown should be feasible too, but I would need help to sort this out. Let's discuss on slack when you can.

from arrow-datafusion.

alamb avatar alamb commented on August 26, 2024

After trying to modify LogicalPlan::TableScan with the extra field, it quickly balloons into needing to change a lot of files (optimizer, etc.). Since scan works it seems like supports_filters_pushdown should be feasible too, but I would need help to sort this out. Let's discuss on slack when you can.

I think #11516 looks promising as a way to potentially pass SessionState to the supports_filters_pushdown API

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.