Git Product home page Git Product logo

Comments (8)

alamb avatar alamb commented on July 1, 2024 2

If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?

In my mind the real use usecase is to more easily use datafusion without having to bring in all the dependencies of LIstingTable (like parquet, avro, json, etc)

So the real usecase is getting ListingTable out of the core. But since the catalog API is in the core now there is no way to get ListingTable out of the core without also first moving the catalog API

Is it due to the complexity of ListingTable so it has it's own crate? If they have common things then it is better to organize them into one crate. If ListingTable is so different than others, it is nice to have an independent crate

I think both the complexity of ListingTable but also because if its dependency tree (e.g. parquet-rs and avro and json and object_store and ...)

For use cases like WASM it is quite messy to have the API split up like it currently is

from arrow-datafusion.

alamb avatar alamb commented on July 1, 2024 1

If this seems like a reasonable idea to people I will file tickets to break down the work

cc @andygrove @jayzhan211 @comphead @mustafasrepo for your thoughts

from arrow-datafusion.

lewiszlw avatar lewiszlw commented on July 1, 2024

I agree with this direction. But now this seems hard to achieve because SchemaProvider depends on TableProvider and TableProvider depends on SessionState.

from arrow-datafusion.

comphead avatar comphead commented on July 1, 2024

Thanks @alamb for starting this discussion.
If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?

Like @lewiszlw correctly mentioned, we got some coupling between providers and the core. I'm just trying to understand the usecase when providers needed without the core

from arrow-datafusion.

jayzhan211 avatar jayzhan211 commented on July 1, 2024

Is it due to the complexity of ListingTable so it has it's own crate? If they have common things then it is better to organize them into one crate. If ListingTable is so different than others, it is nice to have an independent crate

from arrow-datafusion.

alamb avatar alamb commented on July 1, 2024

I agree with this direction. But now this seems hard to achieve because SchemaProvider depends on TableProvider and TableProvider depends on SessionState.

I agree @lewiszlw -- well put. I made a first PR to start detangling things here: #10794 (it just splits SessionState into its own module)

Longer term we would have to figure out where SessionState would live (it still depends on several things in the core crate like datasource::provider and datasource::function 🤔

Maybe we could look into splitting out datafusion-datasource / datafusion-datasource-parquet / datafusion-datsource-avro, etc -- I don't have time to drive this at the moment but would be interested in helping anyone who did

from arrow-datafusion.

comphead avatar comphead commented on July 1, 2024

If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?

In my mind the real use usecase is to more easily use datafusion without having to bring in all the dependencies of LIstingTable (like parquet, avro, json, etc)

So the real usecase is getting ListingTable out of the core. But since the catalog API is in the core now there is no way to get ListingTable out of the core without also first moving the catalog API

Hm... they probably thrive to have their own readers/writes perhaps other than arrow-rs implementation, that makes sense for me. And yes, if DF stands for extensibility we should make this happen. Not sure how difficult that can be though. We probably need to start with replacing core abstractions with traits instead of implementations to decouple it.

from arrow-datafusion.

alamb avatar alamb commented on July 1, 2024

Hm... they probably thrive to have their own readers/writes perhaps other than arrow-rs implementation, that makes sense for me. And yes, if DF stands for extensibility we should make this happen. Not sure how difficult that can be though. We probably need to start with replacing core abstractions with traits instead of implementations to decouple it.

Yes something like this -- I think most of the traits already exist (e.g. CatalogProvider) but figuring out how to decouple SessionState (which is referred to by CatalogProvider is the trickiest bit I think)

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.