Git Product home page Git Product logo

genv's Introduction

genv genv

Genv - GPU Environment and Cluster Management

Join the community at (https://discord.gg/zN3Q9pQAuT) Docs PyPI PyPI - Downloads Conda Conda - Downloads

Genv is an open-source environment and cluster management system for GPUs.

Genv lets you easily control, configure, monitor and enforce the GPU resources that you are using in a GPU machine or cluster.

It is intendend to ease up the process of GPU allocation for data scientists without code changes 💪🏻

Check out the Genv documentation site for more details and the website for a higher-level overview of all features.

This project was highly inspired by pyenv and other version, package and environment management software like Conda, nvm, rbenv.

Example

❓ Why Genv?

  • Easily share GPUs with your teammates
  • Find available GPUs for you to use: on-prem or on cloud via remote access
  • Switch between GPUs without code changes
  • Save time while collaborating
  • Serve and manage local LLMs within your team’s cluster

Plus, it's 100% free and gets installed before you can say Jack Robinson.

🙋 Who uses Genv?

Data Scientists & ML Engineers, who:

  • Share GPUs within a research team
    • Pool GPUs from multiple machines (see here), and allocate the available machine without SSH-ing every one of them
    • Enforce GPU quotas for each team member, ensuring equitable resource allocation (see here)
    • Reserve GPUs by creating a Genv environment for as long as you use them with no one else hijacking them (see here)
  • Share GPUs between different projects
    • Allocate GPUs across different projects by creating distinct Genv environments, each with specific memory requirements
    • Save environment configurations to seamlessly resume work and reproduce experiment settings at a later time (see here)
  • Serve local open-source LLMs for faster experimentation within the whole team
    • Deploy local open-source LLMs for accelerated experimentation across the entire team
    • Efficiently run open-source models within the cluster

Admins, who:

  • Monitor their team’s GPU usage with Grafana dashboard (see the image below)
  • Enforce GPU quotas (number of GPUs and amount of memory) to researchers for a fair game within the team (see here)

genv grafana dashboard

Ollama 🤝 Genv

Ready to create an LLM playground for yourself and your teammates?

Genv integrates with Ollama for managing Large Language Models (LLMs). This allows users to efficiently run, manage, and utilize LLMs on GPUs within their clusters.

$ genv remote -H gpu-server-1, gpu-server-2 llm serve llama2 --gpus 1

Check out our documentation for more information.

🏃 Quick Start

Make sure that you are running on a GPU machine:

$ nvidia-smi
Tue Apr  4 11:17:31 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
...
  1. Install Genv
  • Using pip
pip install genv
  • Using conda
conda install -c conda-forge genv
  1. Verify the installation by running the command:
$ genv status
Environment is not active
  1. Activate an environment (in this example, we activate an environment named my-env that contains 1 GPU and will have 4GB of memory)
$ genv activate –-name my-env —-gpus 1
(genv:my-env)$ genv config gpu-memory 4g
(genv:my-env)$ genv status
Environment is active (22716)
Attached to GPUs at indices 0

Configuration
   Name: my-env
   Device count: 1
   GPU memory capacity: 4g
  1. Start working on your project!

📜 Documentation

Check out the Genv documentation site for more details.

💫 Simple Integration & Usage with your fav IDE

Integration with VSCode (Take me to the installation guide!)
genv vscode

Integration with JupyterLab (Take me to the installation guide!)
genv jupyterlab

A PyCharm integration is also in our roadmap so stay tuned!

🏃🏻 Join us in the AI Infrastructure Club

We love connecting with our community, discussing best practices, discovering new tools, and exchanging ideas with makers about anything around making & AI infrastructure. So we created a space for all these conversations. Join our Discord server for:

  • Genv Installation and setup support as well as best practice tips and tricks directly for your use-case
  • Discussing possible features for Genv (we prioritize your requests)
  • Chatting with other makers about their projects & picking their brain up when you need help
  • Monthly Beers with Engineers sessions with amazing guests from the research and industry (Link to previous session recordings)

License

The Genv software is Copyright 2022 [Run.ai Labs, Ltd.]. The software is licensed by Run.ai under the AGPLv3 license. Please note that Run.ai’s intention in licensing the software are that the obligations of licensee pursuant to the AGPLv3 license should be interpreted broadly. For example, Run.ai’s intention is that the terms “work based on the Program” in Section 0 of the AGPLv3 license, and “Corresponding Source” in Section 1 of the AGPLv3 license, should be interpreted as broadly as possible to the extent permitted under applicable law.

genv's People

Contributors

davidlif avatar ekinkarabulut avatar gitter-badger avatar razrotenberg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

genv's Issues

LLM attach fails in multi user scenarios because of Linux /proc permissions

From the documentation it should be possible to serve an LLM as one user "user1" and then have other users, e.g. "user2", attach to it via genv llm attach modelname.

However, in practice this fails on Linux hosts because genv, when run as "user2", cannot determine the ollama port from the ollama process id if "user1" hosts the model. This is because /proc/processid/fd is only readable by the user who owns the process, in this case "user1".

A workaround is to punch holes in Linux' process isolation, but that's far from ideal. Ideally, genv could track the ollama port besides the process id and make it available to other users, or solve this differently altogether.

genv enforce is not terminating the process when using ray

I am running a script with genv and allocating 1 GPU. So, I started running the script and ran the enforcement command with 0 devices as the enforcement rule. Genv detects that I am using more than I am allowed to:

User ekinkarabulut is using 1 devices which is 1 more than the maximum allowed
Detaching environment 43155 of user ekinkarabulut from device 0

It detaches the genv environment from the device. I can't see any device attached when I run genv devices:

ID      ENV ID      ENV NAME        ATTACHED
0
1

However, it doesn’t terminate the process so my job is still running (I can see it running when I check nvidia-smi):


Wed Aug  2 09:47:34 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    75W / 149W |    505MiB / 11441MiB |     43%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:05.0 Off |                    0 |
| N/A   38C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     43155      C   ray::_wrapper                     502MiB |
+-----------------------------------------------------------------------------+

Enforcement with sudo using sudo -E env PATH="$PATH" genv enforce --interval 3 --max-devices-per-user 0 is giving the same result.

P.s.: To make sure, I also ran another script within a genv environment to make sure that it is not a general issue and enforced the same thing - it terminates the process smoothly with normal scripts without ray. It seems to be an issue for Ray integration

zsh error

Hi,

Great project. I started to play with it, and seems to work well in bash, but it fails immediately in Zsh.

> genv activate                     
_genv_backup_env:2: bad substitution

Probably you used a bash-specific idiom which doesn't work in Zsh.

As a generic advice, maybe running shellcheck could give you improvement ideas:
https://github.com/koalaman/shellcheck

genv-help: command not found

Installed in a jupyterhub/singleuser:3.1.0 docker container with conda install -y -c conda-forge genv

running genv, genv config, genv activate returns:

genv-help: command not found

Incompatible with Anaconda

it does not seem like genv is actually compatible with conda. if you do genv activate then activate your conda environment, your genv reservation just dies. similarly, if you first activate your conda environment, then you install genv, you cannot even use genv activate anymore as it rejects the command. it's kinda messy.

New GPU Addition Not Showing

I just added a fourth GPU to my desktop. genv devices does not show the new index, and can't attach to --index 3.

Tried pip uninstall and pip install (latest release), but that didn't work. Any suggestions will be very appreciated.

(base) anindya@SGPUW2:~$ nvidia-smi
Tue Mar 19 15:33:50 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     Off |   00000000:04:00.0 Off |                  N/A |
|  0%   34C    P8              2W /  165W |      19MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4080        Off |   00000000:17:00.0 Off |                  N/A |
|  0%   42C    P8             19W /  320W |       9MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4060 Ti     Off |   00000000:65:00.0 Off |                  N/A |
|  0%   35C    P8              3W /  165W |       8MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Quadro RTX 8000                Off |   00000000:B3:00.0 Off |                  Off |
| 33%   31C    P8              6W /  260W |       5MiB /  49152MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2737      G   /usr/lib/xorg/Xorg                             14MiB |
|    1   N/A  N/A      2737      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A      2737      G   /usr/lib/xorg/Xorg                              4MiB |
|    3   N/A  N/A      2737      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+
(base) anindya@SGPUW2:~$ genv devices
ID      ENV ID      ENV NAME        ATTACHED
0
1
2
(base) anindya@SGPUW2:~$

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Upon executing genv activate I get:

Traceback (most recent call last):
  File "/mnt/mass_disk1/e.abc/miniconda3/bin/genv", line 10, in <module>
    sys.exit(main())
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/cli/__main__.py", line 116, in main
    activate.run(args.shell, args)
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/cli/activate.py", line 67, in run
    genv.core.envs.activate(
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/core/envs.py", line 88, in activate
    with State() as envs:
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/core/utils.py", line 56, in __enter__
    return self.load()
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/core/utils.py", line 35, in load
    self._state = genv.utils.load_state(
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/utils/utils.py", line 45, in load_state
    o = json.load(f, cls=json_decoder)
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

No idea where it stems from, but it looks like it's some genv and/or environment issue, hence the issue here. I have genv installed via conda command and it's version is 1.2.0.

Configure CUDA and other library versions

Hey, first of all, really cool idea for a tool.

One of the most annoying aspects of sharing a workstation with others is managing different needs for library versions (cuDNN, CUDA Toolkit, etc.). While PyTorch handles this by wrapping the required libraries into the python module, TensorFlow still requires users to install the necessary library versions themselves.

Do you consider supporting the management of the different library/driver versions per environment through this tool?

Multi-machine support

As part of an AI focused institute, genv is a great tool to handle single machines. However, we have a parc composed of many machines. Are there any plans on your side to support/develop features going in the direction of handling the GPU availability in a cluster or simply multiple machines?

Feature request: support for per-user enforcement

Hello, I'm in an academic AI lab that uses genv to manage access to our compute resources. The tool has been great so far, but one thing that would be really convenient for our lab is the ability to set different quotas/constraints for different users, e.g. setting a lower max_devices for certain students in our lab (we want PhD students to have access to more GPUs than undergrads, for example). Having a simple config file for every user that defines their quotas would suffice.

This kind of feature would be greatly appreciated. Thanks!

gev.ray.remote does not work on classes

@genv.ray.remote does not work on classes but only on methods.

When it is used before the functions, it works:

@genv.ray.remote(num_gpus=1)
def train(): 

But when it is used with classes;

@genv.ray.remote(num_gpus=1)
class Trainer:
    def train(self):

it throws the following error:

2023-07-21 08:44:38,183	INFO worker.py:1636 -- Started a local Ray instance.
Traceback (most recent call last):
  File "/home/ekinkarabulut/genv/ray_genv.py", line 105, in <module>
    main_task = trainer.train.remote()
                ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ray._raylet.ObjectRef' object has no attribute 'train'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.