Git Product home page Git Product logo

Comments (2)

rjzamora avatar rjzamora commented on June 30, 2024 1

Thank you for raising @vyasr !

I have spent some time exploring the importance of cudf's NativeFile dependency. In theory, we should be able to achieve the same performance without it. We are not actually using arrow to transfer any remote data at all unless the user specifically opens their file(s) with the pyarrow filesystem API. Instead, we are just using arrow as a translation layer between our python-based fsspec file and something that is recognized by libcudf as a proper data source.

If we were to change the python code to stop relying on NativeFile today, we could probably optimize the existing use_python_file_object=False logic to avoid a significant run-time regression. The only necessary regression (besides losing support for pyarrow filesystems) would be an increase in host-memory usage during partial IO. This is because we would need to pass down a byte range to libcudf that "looks" like an entire file (even if we are only reading a single column, and most of the bytes are left "empty").

Near-term Solution: In order to avoid excessive host-memory usage in the near term, we could probably introduce some kind of "sparse" byte-range data-source to libcudf. It is fairly easy to populate a mapping of known byte ranges efficiently with fsspec. If these known byte ranges could be used to populate a structure that is understood as a file-like object by libcudf, then we can avoid the host-memory issue.

(Possible) Long-term Solution: We roll our own filesystem API at the cpp level and avoid all python-related performance concerns :)

from cudf.

vyasr avatar vyasr commented on June 30, 2024

CC @GregoryKimball @rjzamora

from cudf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.