Git Product home page Git Product logo

Comments (6)

mattf-apache avatar mattf-apache commented on June 17, 2024

Note that the instructions for using a packed conda environment with Spark/PySpark, at https://conda.github.io/conda-pack/spark.html , do not mention the need to run conda-unpack on the nodes.

from conda-pack.

jcrist avatar jcrist commented on June 17, 2024

conda-unpack only needs to be run if there are absolute paths embedded in libraries that need to be resolved before use. This is rare, especially when using from Python. Also, in common configuration the directory YARN localizes (unpacks) the archive to will be read-only to the user - so conda-unpack would fail. Many users have distributed conda environments with conda-pack to run workloads on YARN (both with spark and with dask) with no need to run conda-unpack.

Is there a specific library that requires an absolute path that is causing you problems here? Or is this something we should clarify in our docs?

from conda-pack.

mattf-apache avatar mattf-apache commented on June 17, 2024

Hi @jcrist , after reading the docs, in trying to understand what kind of packages were likely to need path fix-up, I speculated that most pure python packages would be okay to run without conda-unpack, but that complex packages like tensorflow, which link to various C libraries and access hardware drivers, were likely to need path fix-up.

More concretely, I looked in the conda-unpack script for a relatively simple environment with just the packages needed for PySpark Unit Tests to run (numpy, pandas, pyarrow, and scipy, with python2.7), and there were more than 400 lines of fix-ups, which I supposed were important to do. In a more complex environment adding (tensorflow, tensorflow-hub, scikit-learn, psycopg2, pytorch-cpu and cython, with python3.6), there were 900+ lines of fix-ups.

That said, I have not done the extensive testing that would be needed to actually find and prove specific problems resulting from NOT running conda-unpack.

If you say we don't need to run conda-unpack for these common AI libraries, I'll take your word for it. But I would like to understand why these hundreds of fix-ups are okay to ignore, and yes it would be good to expand the docs about what categories of packages typically have absolute paths and therefore need conda-unpack. Thanks.

from conda-pack.

jcrist avatar jcrist commented on June 17, 2024

Often absolute paths are embedded in binary files for stacktraces only, and don't need to be rewritten for the library to work properly. I've only come across one library so far that required running conda-unpack to function properly (clear installed as part of ncurses), and this would never be used by users as part of a dask/spark job. I can't give you a definite list of "these libraries are ok, these ones are not" because all libraries are different, but in my experience most libraries (numpy, pandas, scipy, scikit-learn, pyarrow, tensorflow, etc...) all work fine as is.

yes it would be good to expand the docs about what categories of packages typically have absolute paths and therefore need conda-unpack. Thanks.

Sure. If you have time to submit a PR adding language to the spark docs I'd happily merge it. I'm unlikely to get to this in the near future.

from conda-pack.

mattf-apache avatar mattf-apache commented on June 17, 2024

Truly appreciate the explanation. I will propose a PR for doc change, adding your info and a section to list packages known to need path fix-up (with an invitation to add to the list when encountered). Give me a couple days, as I'm finishing some other work :-)

from conda-pack.

github-actions avatar github-actions commented on June 17, 2024

Hi there, thank you for your contribution!

This issue has been automatically locked because it has not had recent activity after being closed.

Please open a new issue if needed.

Thanks!

from conda-pack.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.