Introduction to Conda for (Data) Scientists

Home Page: https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/

License: Other

Makefile 8.57% R 7.54% Shell 0.66% Python 82.66% Ruby 0.57%

data-science conda r lesson carpentries-incubator alpha python english programming

introduction-to-conda-for-data-scientists's Introduction

Introduction to Conda for (Data) Scientists

This lesson is an introduction to Conda for (data) scientists. Conda is an open source package and environment management system that runs on Windows, macOS and Linux. Conda installs, runs, and updates packages and their dependencies. Conda easily creates, saves, loads, and switches between environments on your local computer. While Conda was created for Python programs it can package and distribute software for any languages such as R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN. This lesson motivates the use of Conda as a development tool for building and sharing project specific software environments that facilitate reproducible (data) science workflows.

Contributing

We welcome all contributions to improve the lesson! Maintainers will do their best to help you if you have any questions, concerns, or experience any difficulties along the way.

We'd like to ask you to familiarize yourself with our Contribution Guide and have a look at the more detailed guidelines on proper formatting, ways to render the lesson locally, and even how to write new episodes.

Please see the current list of issues for ideas for contributing to this repository. For making your contribution, we use the GitHub flow, which is nicely explained in the chapter Contributing to a Project in Pro Git by Scott Chacon. Look for the tag . This indicates that the mantainers will welcome a pull request fixing this issue.

Maintainer(s)

Current maintainers of this lesson are

David R. Pugh

Authors

A list of contributors to the lesson can be found in AUTHORS

Citation

To cite this lesson, please consult with CITATION

introduction-to-conda-for-data-scientists's People

Contributors

Stargazers

Watchers

introduction-to-conda-for-data-scientists's Issues

Episode on Conda + Docker

Add an optional episode on how to integrate Conda + Docker based on the medium article I wrote last year.

https://towardsdatascience.com/conda-pip-and-docker-ftw-d64fe638dc45

2 working with environments

I think there is no need to open bash to manage environments as suggested in the section "Workspace for Conda environments".

Based on this explanation https://conda.io/projects/conda/en/latest/user-guide/getting-started.html

I think

conda create --name

would be enough? (and the OS would not matter)

Actually, I realise I am not clear on why a workspace for Conda environment is created at the outset as proposed in the lesson. I saw afterwards that the conda create -- name is used to create an environment, but I don't understand the difference.

Typos

Quickly ran codespell over the lesson and it caught a few typos:

alanc@~/introduction-to-conda-for-data-scientists(gh-pages)$ codespell --skip="assets,bin,css,data,deprecated,fig,files,AUTHORS,CONTRIBUTORS.md,*.csv,.mailmap" --quiet-level=2 --ignore-words-list="rouge,keyserver"
./_episodes/04-using-packages-and-channels.md:34: directoy  ==> directory
./_episodes/02-working-with-environments.md:129: depedency  ==> dependency
./_episodes/02-working-with-environments.md:195: undesireable  ==> undesirable
./_episodes/03-sharing-environments.md:219: dependecies  ==> dependencies
./_episodes/03-sharing-environments.md:219: enviroment  ==> environment
./_episodes/01-getting-started-with-conda.md:15: enviroment  ==> environment
./_episodes/01-getting-started-with-conda.md:22: pacakges  ==> packages
./_episodes/01-getting-started-with-conda.md:28: sofware  ==> software
./_episodes/01-getting-started-with-conda.md:56: lanaguage  ==> language
./_episodes/01-getting-started-with-conda.md:132: mangager  ==> manager
./_episodes/01-getting-started-with-conda.md:139: verion  ==> version
./_episodes/01-getting-started-with-conda.md:266: upto  ==> up to

Add examples showing how to combine conda and pip

Common to use pip to install packages that are not otherwise available via conda. Good example would be the kaggle package for interfacing with Kaggle API.

Something like the following environment.yml file...

name: null

dependencies:
  ...
  - pip:
    - kaggle

prefix: ./env

Separate episode for Jupyter and Conda (+pip) interop

Move content on Jupyter + Conda from episode into its own episode and add content from JupyterCon 2020 talk.

use of `./env` may be confusing

in 03-using-packages-and-channels, the command conda install tensorflow=1.14 --channel conda-forge --prefix ./env uses ./env as folder name. At the workshop today, this confused learners as they used had learned about conda env. Maybe use --prefix ./my_env instead?

Adding clarification for using conda + pip

In 01-getting-started-with-conda there is this passage:

Conda allows for using other package management tools (such as pip) inside Conda environments, where a library or tools is not already packaged for Conda (we’ll show later how to get access to more conda packages via channels).

As using pip sometimes is unavoidable one should be told that issues can arise when conda and pip are used together, especially when used back-to-back multiple times. A short note and a link to resources might be sufficient.

Exporting an environment using --from-history is not showing pip installed packages!

Is it possible to generate an environment file for an existing environment that includes packages installed via pip?

Add section discussing conda and Jupyter notebooks

@jakevdp has an excellent blog post that summarizes the key issues.

Can this post be summarized in a callout box? Or is more discussion required?

Add example of environment file with channels

There is currently no example environment file explicitly including channels in episode 4.

advanced exercises

during our workshop on the topic we got requests about offering more advanced exercises than 'just repeating what was done in the lesson'.
These exercises could be provided in separate tabs of the exercise box, or as suggested by toby in separate dropdown boxes or or or. (I do not know enough about lesson template and tool to know what could be a good option, open to discussion in comments)
A suggestion would also be to keep the basic exercise set very limited in terms of download and installation times.
So it could be along the lines of
basic: create a conda environment for exploratory data analysis with numpy and matplotlib (or others, it should still make some sense)
advanced: create a conda environment for machine learning with scikit learn etc and some additional questions such as what would you need to do to add package x with dependency y version z (other than what was installed before).

Given examples above are not so useful here, as the additional question is very dependent on where in the material the exercise is located. But I added them to explain how it could look.

Demonstrate effect of conda activate

In 02-working-with-environments, around the first time conda activate is run, it would be beneficial to demonstrate the effect of the command by something like:

run $ python (to get into the interactive Python editor)
*attempt to import a package so a ModuleNotFoundError appears
then run conda activate basic-scipy-env
now redo the import showing that the package now is there

This requires the package not to be installed in the base installation. Alternatively, different versions of Python can be used (the Python editor first shows which version is used).

Update PyTorch version in PyTorch Geometric environment.yml example

Environment no longer builds properly. Bumping PyTorch version to 1.5 solved the problem.

Section on conda install command in episode 2

Episode 2 needs a short section on using conda install to install a package into an activated environment. Section should include a callout box showing how pip can be used to install a package into an environment.

Example of creating a Python kernel from a Conda environment

Add a section or callout discussion how to create your own Python kernel from a Conda environment and make it accessible from within JupyterLab. This is super useful!

https://stackoverflow.com/questions/53004311/how-to-add-conda-environment-to-jupyter-lab

Add discussion of Mamba and the Snakepit

Recent developments of Mamba deserve a mention as it should improve the UX for some users. The SnakePit which is a parallel effort to build a community around Mamba/Conda might also be worth a mention.

Make these lessons teachable via Binder

Want to make it easy for learners to get started learning Conda in a cloud environment without having to install it themselves. Why?

Conda is already available in many images widely used in public clouds so in many cases users will not need to know how to install it themselves.
Makes this course easier to teach remotely where it is difficult to debug installation issues should they occur.

To do this I need to do the following.

Add Binder config files to project root
Add Binder links to GitHub README and to lesson setup instructions
Add section on creating a new Python kernel from an existing Conda environment to the lessons.

FIXME link broken on contributing.md

Hi,

the link FIXME available on CONTRIBUTING.md seems to be broken

Cheers
FP

Suggestion for rewording the conda vs pip section

Currently the conda vs pip section focuses on the environment management side (which to me isn't a very strong argument), rather than the other advantages of conda (e.g. the use of MKL, cross platform support, and the prebuilt binaries). It also focuses on the conda vs pip issue, rather than why use conda (as opposed to other systems, e.g. docker, spack).

Would you accept a PR which rewords it giving the following points (under the title "Why use Conda":

Anaconda provides commonly used data science libraries and tools built against Intel's numerical libraries (mention tensorflow here, but I'd add numpy and scipy, and I presume R?)
Conda is cross platform and support for duplicating your install on other peoples systems (add a link about reproducibility?)
Pip (and I guess the equivalent R tools?) can be used to install additional things that have not been packaged yet

And then a short section on why not to use conda (prebuilt binaries are slower, not integrated with the rest of the system, so speak to admins of clusters/supercomputers in case they have a better suggestion)?

summarizing exercise

In preparation for the workshop we were thinking that it would be nice to offer some summarizing exercise in the end of the workshop, where everything or most of what was learned can be applied.

@bast created one such exercise here: https://github.com/coderefinery/conda-exercise

Another idea would be to team up participants (preferably with different OS) in teams of 2, and let them go through a workflow similar how it could go in a project with different steps:
First, both create an environment (packages could be suggested by material), write or export an environment.yml without version numbers and share with partner, partner creates environment from file and results are compared (to notice the differenced coming from low level os dependent packages, possibly version differences,...). Then same with version numbers.
Could make use of some diff tool to explore differences of environments.

This could then also be taken even further as discussed with @davidrpugh , @annefou, @naoe-tatara and toby today to go towards actually finding a package with some deprecated function call and let the learners build an environment for that case with provided code to test if it works and to see how the error messages may look.

Happy to discuss different possibilities of a summarizing exercise here :)

conda help is not a valid command

In the setup file, it says to check your conda install using "conda help", which on my 2017 macbook pro running ventura creates a command not found error. While it's possible this is somehow just me, from my experience with convention, it is not just me and the lesson needs to be changed to say "conda --help". I'll make the pr if desired.

Add an example of cloning a conda environment

Common used packages go into a "my-common-packages" environment and then you can clone this environment for all new projects.

Suggestion for "02 Working with Environments"

From the material it did not become clear for me why creating projects in subfolders is beneficial.
Compared to the drawbacks it seems like one should have very good reasons to do so.

"Placing Conda environments outside of the default ~/miniconda3/envs/ folder comes with a couple of minor drawbacks. First, conda can no longer find your environment with the --name flag; you’ll generally need to pass the --prefix flag along with the environment’s full path to find the environment."

..also commands like conda env list do not work when placing environments out of the default location. (as another example)

Afaik its really a matter of taste and having different names + export environments also has advantages.

I'd suggest to really point out the advantages or maybe map both methods to examples and explain which method is better suited for which example.
Also, in the further course switches between different location variants are distracting from the actual topic covered and might be rather irritating to participants. I'd suggest here to focus on one single variant (either conda create --name python36-env or conda create --prefix ./env ) and strictly stick to it, adding the alternative execution method as additional material instead.

Add examples of how to add channel priorities to environment files.

Need explicit discussion of how to add channel priorities to environment files. Typical way is to add a channels section to your environment.yml file.

name: null

channels:
  - conda-forge
  - defaults

...

Priority is from highest to lowest, so in the above the conda-forge channel would be given highest priority.

Add examples for R environments

Need to add a few more examples showing users how to create environments for R.

Fix for example "Creating a new environment as a sub-directory within a project directory" needed?

Testing the example "Creating a new environment as a sub-directory within a project directory":

project-dir $ conda create --prefix ./env
python=3.6
matplotlib=3.1
tensorflow=2.1
pip=20.0

does not work with the specified tensorflow version on Mac OS X 10.15.7:

PackagesNotFoundError: The following packages are not available from current channels:

When running without specification, tensorflow 1.14.0 gets installed instead.

conda env export --from-history

in 03- sharing environments is a box stating to beware the conda env export command.

How about suggesting to use 'conda env export --from-history'?
This then includes only packages specifically installed, so some underlying packages may have different versions on different systems/at different points in time. But the resulting yml can be used across different platforms (to my knowledge).

What do you think?

Installation instructions

Suggestion to restructure installation instructions:

binder as a 'last way out', so in the bottom of the page?
installation walkthrough for all OS (link downloads something, what to click in installer, etc)
workspace setup not speciifcly reused in lesson? -> unnecessary at this point?
mention anaconda navigator - anaconda prompt (why both there for some OS, what to use...?)

Happy to discuss and help to restructure (Ubuntu 16/20 and MacOS mojave available) :)

conda cheatsheet

Originally discussed with @bast (please add or clarify if I forgot something)

It would be great to have some kind of cheatsheet for the workshop with short explanations, including:

all commands used,
possible useful options for some commands
environment.yml sections
other useful commands related to conda, such as which python (unix)/ where python(windows)
...

separate 'use of conda + pip'

Hei,

this is a suggestion to gather all pip related information and exercise at the end of lesson 3 except for the pip in environment files.
This could be a whole own section there, with explanation what is pip, why pip useful also when using conda, and how it can be used together and what may happen if pip is not explicitly installed into an environment.

Provide an example of how to create env by prefix but with a name

Want to show users how to create environment by prefix (i.e., in a particular location) but with a name so that it can be activated by name. This appears to be possible.

https://stackoverflow.com/questions/49638329/how-to-create-conda-env-with-both-name-and-path-specified

Episode on Conda build

There should probably be an episode on Conda build at some point. List of useful references to get started...

https://docs.conda.io/projects/conda-build/en/latest/index.html

Binder not working?

Hei,

we just tested the binder link in the setup instructions with a few people, and unfortunately it did not work for anyone.
In some cases a popup suggested: "Build Recommended JupyterLab build is suggested: jupyter-offlinenotebook needs to be included in build", after clicking built it failed to build and nothing more happened and there was nothing more to do.
In other cases there was no popup suggesting anything, but we were not able to click either files on the left nor launch anything.
I unfortunately do not know what could be wrong here, so it would be great if someone could help to fix :)

(We will probably be test-teaching this lesson in the beginning of january in Finland-Norway cooperation online)

Greetings,
Samantha

Use of set env_prompt needs more explanations

In Episode 02, it is shown how to modify the env_prompt setting in the .condarc file, following the command conda config --set env_prompt '({name})'. It would be useful to add some explanations about:

the argument "name" inside the command shouldn't be replaced by the environment name.
if users open the terminal after running this command, they would see the "anaconda3" instead of “base” on the left side of prompt because (for example in my case on linux) the “base” is located in this path ~/anaconda3.
one way to fix the problem above is to remove the ~/.condarc file.

carpentries-incubator / introduction-to-conda-for-data-scientists Goto Github PK

introduction-to-conda-for-data-scientists's Introduction

Introduction to Conda for (Data) Scientists

Contributing

Maintainer(s)

Authors

Citation

introduction-to-conda-for-data-scientists's People

Contributors

Stargazers

Watchers

Forkers

introduction-to-conda-for-data-scientists's Issues

Recommend Projects

Recommend Topics

Recommend Org