Comments (8)
If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?
In my mind the real use usecase is to more easily use datafusion without having to bring in all the dependencies of LIstingTable (like parquet, avro, json, etc)
So the real usecase is getting ListingTable out of the core. But since the catalog API is in the core now there is no way to get ListingTable out of the core without also first moving the catalog API
Is it due to the complexity of ListingTable so it has it's own crate? If they have common things then it is better to organize them into one crate. If ListingTable is so different than others, it is nice to have an independent crate
I think both the complexity of ListingTable but also because if its dependency tree (e.g. parquet-rs and avro and json and object_store and ...)
For use cases like WASM
it is quite messy to have the API split up like it currently is
from arrow-datafusion.
If this seems like a reasonable idea to people I will file tickets to break down the work
cc @andygrove @jayzhan211 @comphead @mustafasrepo for your thoughts
from arrow-datafusion.
I agree with this direction. But now this seems hard to achieve because SchemaProvider
depends on TableProvider
and TableProvider
depends on SessionState
.
from arrow-datafusion.
Thanks @alamb for starting this discussion.
If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?
Like @lewiszlw correctly mentioned, we got some coupling between providers and the core. I'm just trying to understand the usecase when providers needed without the core
from arrow-datafusion.
Is it due to the complexity of ListingTable
so it has it's own crate? If they have common things then it is better to organize them into one crate. If ListingTable is so different than others, it is nice to have an independent crate
from arrow-datafusion.
I agree with this direction. But now this seems hard to achieve because SchemaProvider depends on TableProvider and TableProvider depends on SessionState.
I agree @lewiszlw -- well put. I made a first PR to start detangling things here: #10794 (it just splits SessionState
into its own module)
Longer term we would have to figure out where SessionState
would live (it still depends on several things in the core crate like datasource::provider
and datasource::function
🤔
Maybe we could look into splitting out datafusion-datasource
/ datafusion-datasource-parquet
/ datafusion-datsource-avro
, etc -- I don't have time to drive this at the moment but would be interested in helping anyone who did
from arrow-datafusion.
If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?
In my mind the real use usecase is to more easily use datafusion without having to bring in all the dependencies of LIstingTable (like parquet, avro, json, etc)
So the real usecase is getting ListingTable out of the core. But since the catalog API is in the core now there is no way to get ListingTable out of the core without also first moving the catalog API
Hm... they probably thrive to have their own readers/writes perhaps other than arrow-rs implementation, that makes sense for me. And yes, if DF stands for extensibility we should make this happen. Not sure how difficult that can be though. We probably need to start with replacing core abstractions with traits instead of implementations to decouple it.
from arrow-datafusion.
Hm... they probably thrive to have their own readers/writes perhaps other than arrow-rs implementation, that makes sense for me. And yes, if DF stands for extensibility we should make this happen. Not sure how difficult that can be though. We probably need to start with replacing core abstractions with traits instead of implementations to decouple it.
Yes something like this -- I think most of the traits already exist (e.g. CatalogProvider
) but figuring out how to decouple SessionState (which is referred to by CatalogProvider
is the trickiest bit I think)
from arrow-datafusion.
Related Issues (20)
- Improve filter predicates with `Utf8View` literals
- Convert `ArrayAgg` to UDAF HOT 10
- Reduce test duplication in tests for data page stattistics
- Add drop_columns to dataframe api HOT 1
- Convert `BoolAndOr` to UDAF HOT 1
- Add distinct_on to dataframe api HOT 1
- Implement min/max for interval types
- Pushdown filters that do not reference unested columns HOT 1
- Support named placeholders wherever numeric ones are allowed
- Support for `LargeString` and `LargeBinary` for `StringView` and `BinaryView` HOT 1
- Implement `LIKE` for StringView arrays HOT 1
- Implement `REGEXP_REPLACE` for StringView HOT 2
- Support `String/LargeString` and `Binary/LargeBinary` and `FixedSizeBinary` Parquet Data Page Statistics HOT 4
- Support `Boolean` Parquet Data Page Statistics HOT 8
- Implement SQLancer (a end-to-end SQL fuzz testing library) HOT 1
- Bugs in LCM/GCD scalar functions (found by SQLancer) HOT 1
- Improve `LIKE` performance for Dictionary arrays HOT 1
- to_timestamp functions should preserve timezone from inputs
- Int64 should be coercible to timestamp types
- Potential memory issue when using COPY with PARTITIONED BY HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.