Git Product home page Git Product logo

Comments (10)

shahamran avatar shahamran commented on August 28, 2024 1

@hantusk this looks very cool and potentially a good workaround. Thanks for sharing!

from pyo3-polars.

ritchie46 avatar ritchie46 commented on August 28, 2024 1

This is something I want to get into to. But it need to be more than a trait as we want to get over FFI. On the rust side there is already AnymousSource. This will be extended to support the new streaming engine.

from pyo3-polars.

ritchie46 avatar ritchie46 commented on August 28, 2024 1

Wow, you are quick. I am still working on the example. :D

from pyo3-polars.

NielsPraet avatar NielsPraet commented on August 28, 2024

For my thesis I am currently looking at how I can hook an existing backend query service into Polars to use the Lazy DataFrame API. This however would need to be passed from the Rust side to the Python side as the use-case is aimed at Data Scientists / ML Engineers working in Python. From what I gathered it unfortunately seems to be impossible to do so right now, so I want to +1 this issue as this would in general open up a lot of possibilities for the Polars eco system!

from pyo3-polars.

shahamran avatar shahamran commented on August 28, 2024

Can anyone suggest how to work around this limitation? That is, how can I "extend polars" to support scanning my custom file formats?

I looked at https://github.com/universalmind303/polars-mongo which seems clean and straight-forward, but suffers from the same limitation as in #67.

from pyo3-polars.

hantusk avatar hantusk commented on August 28, 2024

You might be able to scan your custom file formats using fsspec. Here's an example: https://csvbase.com/blog/7.

from pyo3-polars.

shahamran avatar shahamran commented on August 28, 2024

Hi @ritchie46, I've been using the newly released IO plugins and it works well, thank you.

I have a question regarding n_rows. In the docstring it says:

n_rows: Materialize only n rows from the source. The reader can stop when n_rows are read.

Is it before or after the predicate is applied? In this context, what's the meaning of "materialize"?

Thanks again for implementing this!

from pyo3-polars.

ritchie46 avatar ritchie46 commented on August 28, 2024

Here is the working example; https://github.com/pola-rs/pyo3-polars/tree/main/example/io_plugin

from pyo3-polars.

shahamran avatar shahamran commented on August 28, 2024

@ritchie46 thank you.

I understand from this that n_rows can be used regardless of predicate. I have another question. Can I modify n_rows to account for batch sizes? e.g.:

def _read_my_format_impl(path: str, ...) -> pl.DataFrame: ...

def scan_my_format(paths, ...) -> pl.LazyFrame:
    def _read_my_format(with_columns, predicate, n_rows, batch_size):
        for path in paths:
            df = _read_my_format_impl(path, columns=with_columns, n_rows=n_rows)
            if predicate is not None:
                df = df.filter(predicate)
            yield df
            if n_rows is not None:
                n_rows -= df.height  # <-- is this legit?
                if n_rows <= 0:
                    break

    return register_io_source(callable=_read_my_format, schema=...)

from pyo3-polars.

ritchie46 avatar ritchie46 commented on August 28, 2024

Maybe. You are not allowed to return more than n_rows. It is the upper limit.

from pyo3-polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.