Git Product home page Git Product logo

Comments (29)

josevalim avatar josevalim commented on August 22, 2024 4

@rupurt note I am working on ADBC adapter for Explorer+Polars to cover the database connectivity bits (I am often livestreaming it on twitch.tv/josevalim).

from explorer.

matreyes avatar matreyes commented on August 22, 2024 3

Hi @cigrainger , thanks for this amazing work!
IMHO, and just for the discussion:
Polars and Datafusion are similar. they have dataframe apis on top of apache arrow (data structure), so having both backends is reasonable.

The pure Elixir implementation means that the APIs would be implemented in Elixir, but the data structure could be also Apache Arrow (allowing interoperability), and it could be implemented from zero or using the rust crate.

For me, Ecto is different, as it handles persistence. I would like to have an Explorer.ecto_insert() or Explorer.from_ecto(), which would be similar to to_csv and from_csv apis, and which would be independent from the backend.

DuckDb is also persistence through SQL. Technically is very similar to polars, but the objective is to have a single file database like sqlite. I think that DuckDb could be implemented as an ecto repo, like exqlite / ecto_sqlite3, and interoperability could be handled by sql/ecto, through parquet files or directly with arrow (gRPC), but that is another project.

from explorer.

rupurt avatar rupurt commented on August 22, 2024 2

@cigrainger big +1 to those data backends. There isn't currently an ecto adapter for kdb+. If one isn't possible I'd also throw that on the list of important backends.

from explorer.

rupurt avatar rupurt commented on August 22, 2024 2

Are there any resources besides the current polars backend that can help me get started on writing additional backends. I'm pretty familiar with DuckDB and would love to get started on writing one.

I'd also like to create an ODBC backend which would be useful for connecting to many other database engines.

from explorer.

josevalim avatar josevalim commented on August 22, 2024 2

We did land Adbc support on main. :) We also added an API for loading data from an ArrowStream, which could perhaps be a mechanism to integrate Duck and Polars in the future. I think we can close this issue for now. Lazy backend is covered and, we could explore others, but I don't think it is a priority given where the project is. :)

from explorer.

cigrainger avatar cigrainger commented on August 22, 2024 1

@kimjoaoun I'd like to make some progress on #54 first, which I've started as of last week. It should inform the other approaches.

from explorer.

cigrainger avatar cigrainger commented on August 22, 2024 1

Those would be very welcome! At a very high level the backend just needs to implement the defined behaviours. E.g. Explorer.Backend.DataFrame. And at first it doesn't need to implement the whole thing. Personally, I'd start at one or two IO functions, then start adding simple stuff from there (e.g. select).

Feel free to ping me (EEF slack it's probably the best place) to discuss!

from explorer.

cigrainger avatar cigrainger commented on August 22, 2024 1

DuckDB supports ADBC as of 0.8.

from explorer.

josevalim avatar josevalim commented on August 22, 2024 1

It can be a custom backend, if you wanna tackle it. The Lazy backend here is already capable of building a query, then you would need a translation layer to SQL (depending on the underlying SQL database).

from explorer.

cigrainger avatar cigrainger commented on August 22, 2024

Thanks @rupurt! I have very little experience of kdb+ except for a colleague raving about it. Could you flesh out the use case? Would you be interested in contributing?

from explorer.

emilioforrer avatar emilioforrer commented on August 22, 2024

Hi @cigrainger , thanks for all the awesome work!

As for the back-end for DataFusion, can checkout this library that already have Elixir bindings for Apache Arrow, Parquet and DataFusion.

elixir-arrow

It seems like already implements the basics.

from explorer.

cigrainger avatar cigrainger commented on August 22, 2024

Thanks for that @emilioforrer! I actually used @treebee's fork of ex_polars when initially building Explorer! I didn't realise they were also working on this. I'll look into it.

from explorer.

kimjoaoun avatar kimjoaoun commented on August 22, 2024

What we would need to do to start working on an explorer_ecto backend? 👀
(btw, is it too early to start that?)

from explorer.

srowley avatar srowley commented on August 22, 2024

One thing to consider is the approach for testing; I worked on a pure Elixir backend for Series this weekend just to get familiar with it, and testing a different backend is a challenge given all the doctests (hard to just implement one thing at a time), a test that relies on the underlying RNG implementation here, a test that is tied to the DataFrame implementation there (I think, that one could just be me) and so on.

I don't pretend to know how to tackle that, but it would definitely be a thing to think through (or at least document if it is already more doable than I am making it out to be) as a precursor to developing additional backends.

Edit: It's not that hard to test one thing at a time with doctests; I'm sure I would never have found this if I hadn't complained publicly first.

from explorer.

cigrainger avatar cigrainger commented on August 22, 2024

Thanks @srowley good point.

Just as a PSA I think we'll skip the pure Elixir backend because @philss has been working on precompilation, so the pain point we wanted to address should be moot.

from explorer.

josevalim avatar josevalim commented on August 22, 2024

I also think it may be easier to go with Postgrex/Myxql directly rather than Ecto. I have a hunch that if we use Ecto we will be mostly fighting against its DSL and we really only need a small subset of what Ecto provides.

from explorer.

cigrainger avatar cigrainger commented on August 22, 2024

Hmmm.. my thinking was that the Ecto DSL was exactly what would make it more approachable. For example, a big chunk of dplyr is the way it builds composable queries and translates to SQL: https://dbplyr.tidyverse.org/articles/sql-translation.html#single-table-verbs. So in the same way that we can leverage the polars lazy API and build up queries against in memory dataframes, we can leverage the ecto DSL and build up queries against the db. But I agree there will likely be some fights with the DSL. I just think I'd rather use it than reimplement parts.

from explorer.

josevalim avatar josevalim commented on August 22, 2024

As an example, I think Ecto queries do not allow the field names to be strings. So at least this would need to change. Plus Ecto brings changesets, schemas, transactions, the necessity to define a repository per connection... and I think none of this is actually necessary by Explorer? The only part that is really necessary is the AST to SQL layer and that's the smallest problem Ecto (ecto_sql) solves. :) The other part that we would need is managing connections, queries, encoding/decoding, but this is done by the adapters.

from explorer.

cigrainger avatar cigrainger commented on August 22, 2024

That makes a lot of sense. It definitely seems like it would bring a lot of unnecessary baggage. I suppose I'm just a bit daunted by building up an AST from Explorer functions and translating them to SQL. If I'm overestimating that task, then absolutely happy to skip Ecto and just rely on the adapters.

from explorer.

kevinkirkup avatar kevinkirkup commented on August 22, 2024

HDF5 file format would be useful.

https://en.m.wikipedia.org/wiki/Hierarchical_Data_Format

from explorer.

rupurt avatar rupurt commented on August 22, 2024

@josevalim sweet. Do you have a link to a repo? I could probably use that to get started on the ODBC one. I've written an ecto adapter for Db2 (closed source :{) and I'm pretty familiar with the spec and the current shortcomings with the erlang ODBC driver.

It sounds like the ADBC one would probably make a separate DuckDB backend obsolute.

from explorer.

josevalim avatar josevalim commented on August 22, 2024

We are working on github.com/cocoa-xu/adbc and there is a branch. But it is still early stage and very WIP. I think the ADBC is orthogonal to the DuckDB and Polars ones.

from explorer.

josevalim avatar josevalim commented on August 22, 2024

Nice find. Polars for Python supports it too: https://pola-rs.github.io/polars-book/user-guide/io/database/

from explorer.

rupurt avatar rupurt commented on August 22, 2024

Awesome. Thanks @josevalim. The ADBC repo looks like a fantastic resource as a baseline for ODBC.

from explorer.

rupurt avatar rupurt commented on August 22, 2024

@josevalim @cigrainger I took a bit of a different path. I ended up creating a DuckDB extension so that it can be used in more contexts. That should mean that once the ADBC backend is ready we can connect to ODBC datasources in Elixir through the extension.

https://github.com/rupurt/odbc-scanner-duckdb-extension

from explorer.

rupurt avatar rupurt commented on August 22, 2024

Sweet. My next step is to get this working in Elixir so hopefully what we currently have is enough to get this going.

from explorer.

abrunner94 avatar abrunner94 commented on August 22, 2024

I'm looking for the equivalent of to_sql(dataframe) in Pandas. Essentially, my goal is to write dataframes to MySQL, Postgres, etc. Is there any way to do this at this time?

from explorer.

josevalim avatar josevalim commented on August 22, 2024

It is not possible currently. :)

from explorer.

abrunner94 avatar abrunner94 commented on August 22, 2024

Ah bummer! Will it ever be part of Explorer or is anything similar planned?

from explorer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.