run-ai / genv Goto Github PK

View Code? Open in Web Editor NEW

455.0 8.0 21.0 9.64 MB

GPU environment and cluster management with LLM support

Home Page: https://www.genv.dev

License: GNU Affero General Public License v3.0

Python 96.89% Shell 2.73% Dockerfile 0.38%

gpu docker gpus nvidia-gpu bash data-science deep-learning jupyter-notebook jupyterlab-extension vscode

genv's Introduction

Genv - GPU Environment and Cluster Management

Genv is an open-source environment and cluster management system for GPUs.

Genv lets you easily control, configure, monitor and enforce the GPU resources that you are using in a GPU machine or cluster.

It is intendend to ease up the process of GPU allocation for data scientists without code changes 💪🏻

Check out the Genv documentation site for more details and the website for a higher-level overview of all features.

This project was highly inspired by pyenv and other version, package and environment management software like Conda, nvm, rbenv.

❓ Why Genv?

Easily share GPUs with your teammates
Find available GPUs for you to use: on-prem or on cloud via remote access
Switch between GPUs without code changes
Save time while collaborating
Serve and manage local LLMs within your team’s cluster

Plus, it's 100% free and gets installed before you can say Jack Robinson.

🙋 Who uses Genv?

Data Scientists & ML Engineers, who:

Share GPUs within a research team
- Pool GPUs from multiple machines (see here), and allocate the available machine without SSH-ing every one of them
- Enforce GPU quotas for each team member, ensuring equitable resource allocation (see here)
- Reserve GPUs by creating a Genv environment for as long as you use them with no one else hijacking them (see here)
Share GPUs between different projects
- Allocate GPUs across different projects by creating distinct Genv environments, each with specific memory requirements
- Save environment configurations to seamlessly resume work and reproduce experiment settings at a later time (see here)
Serve local open-source LLMs for faster experimentation within the whole team
- Deploy local open-source LLMs for accelerated experimentation across the entire team
- Efficiently run open-source models within the cluster

Admins, who:

Monitor their team’s GPU usage with Grafana dashboard (see the image below)
Enforce GPU quotas (number of GPUs and amount of memory) to researchers for a fair game within the team (see here)

Ollama 🤝 Genv

Ready to create an LLM playground for yourself and your teammates?

Genv integrates with Ollama for managing Large Language Models (LLMs). This allows users to efficiently run, manage, and utilize LLMs on GPUs within their clusters.

$ genv remote -H gpu-server-1, gpu-server-2 llm serve llama2 --gpus 1

Check out our documentation for more information.

🏃 Quick Start

Make sure that you are running on a GPU machine:

$ nvidia-smi
Tue Apr  4 11:17:31 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
...

Install Genv

Using pip

pip install genv

Using conda

conda install -c conda-forge genv

Verify the installation by running the command:

$ genv status
Environment is not active

Activate an environment (in this example, we activate an environment named my-env that contains 1 GPU and will have 4GB of memory)

$ genv activate –-name my-env —-gpus 1
(genv:my-env)$ genv config gpu-memory 4g
(genv:my-env)$ genv status
Environment is active (22716)
Attached to GPUs at indices 0

Configuration
   Name: my-env
   Device count: 1
   GPU memory capacity: 4g

Start working on your project!

📜 Documentation

Check out the Genv documentation site for more details.

💫 Simple Integration & Usage with your fav IDE

Integration with VSCode (Take me to the installation guide!)

Integration with JupyterLab (Take me to the installation guide!)

A PyCharm integration is also in our roadmap so stay tuned!

🏃🏻 Join us in the AI Infrastructure Club

We love connecting with our community, discussing best practices, discovering new tools, and exchanging ideas with makers about anything around making & AI infrastructure. So we created a space for all these conversations. Join our Discord server for:

Genv Installation and setup support as well as best practice tips and tricks directly for your use-case
Discussing possible features for Genv (we prioritize your requests)
Chatting with other makers about their projects & picking their brain up when you need help
Monthly Beers with Engineers sessions with amazing guests from the research and industry (Link to previous session recordings)

License

The Genv software is Copyright 2022 [Run.ai Labs, Ltd.]. The software is licensed by Run.ai under the AGPLv3 license. Please note that Run.ai’s intention in licensing the software are that the obligations of licensee pursuant to the AGPLv3 license should be interpreted broadly. For example, Run.ai’s intention is that the terms “work based on the Program” in Section 0 of the AGPLv3 license, and “Corresponding Source” in Section 1 of the AGPLv3 license, should be interpreted as broadly as possible to the extent permitted under applicable law.

genv's People

Contributors

Stargazers

Watchers

genv's Issues

Feature request: support for per-user enforcement

Hello, I'm in an academic AI lab that uses genv to manage access to our compute resources. The tool has been great so far, but one thing that would be really convenient for our lab is the ability to set different quotas/constraints for different users, e.g. setting a lower max_devices for certain students in our lab (we want PhD students to have access to more GPUs than undergrads, for example). Having a simple config file for every user that defines their quotas would suffice.

This kind of feature would be greatly appreciated. Thanks!

Document usage with tmux

Incompatible with Anaconda

it does not seem like genv is actually compatible with conda. if you do genv activate then activate your conda environment, your genv reservation just dies. similarly, if you first activate your conda environment, then you install genv, you cannot even use genv activate anymore as it rejects the command. it's kinda messy.

genv remote llm fails to extract port number

With genv 1.4.1 genv llm works for the multi-user use case. But genv remote llm doesn't seem to have picked up on the change:

$ genv remote -H <host> llm ps
HOST             MODEL       PORT    CREATED              EID     USER
<host>           llama2:7b/45893N/A     2 hours ago          2891    <user>

note how the model is reported as "llama2:7b/45893" with the port presumably being "N/A".

Correspondingly, the following happens:

$ genv remote -H <host> llm attach llama2:7b
Could not find LLM model 'llama2:7b'
$ genv remote -H <host> llm attach llama2:7b/45893
pulling manifest
Error: pull model manifest: file does not exist

gev.ray.remote does not work on classes

@genv.ray.remote does not work on classes but only on methods.

When it is used before the functions, it works:

@genv.ray.remote(num_gpus=1)
def train():

But when it is used with classes;

@genv.ray.remote(num_gpus=1)
class Trainer:
    def train(self):

it throws the following error:

2023-07-21 08:44:38,183	INFO worker.py:1636 -- Started a local Ray instance.
Traceback (most recent call last):
  File "/home/ekinkarabulut/genv/ray_genv.py", line 105, in <module>
    main_task = trainer.train.remote()
                ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ray._raylet.ObjectRef' object has no attribute 'train'

New GPU Addition Not Showing

I just added a fourth GPU to my desktop. genv devices does not show the new index, and can't attach to --index 3.

Tried pip uninstall and pip install (latest release), but that didn't work. Any suggestions will be very appreciated.

(base) anindya@SGPUW2:~$ nvidia-smi
Tue Mar 19 15:33:50 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     Off |   00000000:04:00.0 Off |                  N/A |
|  0%   34C    P8              2W /  165W |      19MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4080        Off |   00000000:17:00.0 Off |                  N/A |
|  0%   42C    P8             19W /  320W |       9MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4060 Ti     Off |   00000000:65:00.0 Off |                  N/A |
|  0%   35C    P8              3W /  165W |       8MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Quadro RTX 8000                Off |   00000000:B3:00.0 Off |                  Off |
| 33%   31C    P8              6W /  260W |       5MiB /  49152MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2737      G   /usr/lib/xorg/Xorg                             14MiB |
|    1   N/A  N/A      2737      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A      2737      G   /usr/lib/xorg/Xorg                              4MiB |
|    3   N/A  N/A      2737      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+
(base) anindya@SGPUW2:~$ genv devices
ID      ENV ID      ENV NAME        ATTACHED
0
1
2
(base) anindya@SGPUW2:~$

[BUG] GPU memory error when reenter the container

Hi, I noticed there a bug for container toolkit, I started a container with 512MB of GRAM using genv-docker with python:3 image, when I enter the container using docker exec -it bash, then type nvidia-smi, the memory amount is not 512MB, instead it shows the full memory amount.

Steps to reproduce the bug

genv-docker run -it -d --rm --gpus 1 --gpu-memory 512mi --entrypoint bash python:3
docker exec -it <CONTAINER_ID> bash
when inside container type, "nvidia-smi"

[feature request] Split computing power?

Hello genv dev team,
Is it possible to add a function to split the gpu utilization as well? For example if I share my GPU to my friend, and I would like to limit his GPU utilization to 50%, so there is some GPU utilization left for myself.

Multi-machine support

As part of an AI focused institute, genv is a great tool to handle single machines. However, we have a parc composed of many machines. Are there any plans on your side to support/develop features going in the direction of handling the GPU availability in a cluster or simply multiple machines?

Configure CUDA and other library versions

Hey, first of all, really cool idea for a tool.

One of the most annoying aspects of sharing a workstation with others is managing different needs for library versions (cuDNN, CUDA Toolkit, etc.). While PyTorch handles this by wrapping the required libraries into the python module, TensorFlow still requires users to install the necessary library versions themselves.

Do you consider supporting the management of the different library/driver versions per environment through this tool?

performance testing

Hi, I would like to test the performance of the GPU when using genv container toolkit. How much performance degradation can be expected when using containers compare to bear metal?

Failing `genv attach` commands detach shell

When genv attach fails, the shell thinks it's detached (i.e. CUDA_VISIBLE_DEVICES is -1) but it's still attached in Genv.

genv-help: command not found

Installed in a jupyterhub/singleuser:3.1.0 docker container with conda install -y -c conda-forge genv

running genv, genv config, genv activate returns:

genv-help: command not found

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Upon executing genv activate I get:

Traceback (most recent call last):
  File "/mnt/mass_disk1/e.abc/miniconda3/bin/genv", line 10, in <module>
    sys.exit(main())
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/cli/__main__.py", line 116, in main
    activate.run(args.shell, args)
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/cli/activate.py", line 67, in run
    genv.core.envs.activate(
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/core/envs.py", line 88, in activate
    with State() as envs:
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/core/utils.py", line 56, in __enter__
    return self.load()
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/core/utils.py", line 35, in load
    self._state = genv.utils.load_state(
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/utils/utils.py", line 45, in load_state
    o = json.load(f, cls=json_decoder)
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

No idea where it stems from, but it looks like it's some genv and/or environment issue, hence the issue here. I have genv installed via conda command and it's version is 1.2.0.

LLM attach fails in multi user scenarios because of Linux /proc permissions

From the documentation it should be possible to serve an LLM as one user "user1" and then have other users, e.g. "user2", attach to it via genv llm attach modelname.

However, in practice this fails on Linux hosts because genv, when run as "user2", cannot determine the ollama port from the ollama process id if "user1" hosts the model. This is because /proc/processid/fd is only readable by the user who owns the process, in this case "user1".

A workaround is to punch holes in Linux' process isolation, but that's far from ideal. Ideally, genv could track the ollama port besides the process id and make it available to other users, or solve this differently altogether.

genv enforce is not terminating the process when using ray

I am running a script with genv and allocating 1 GPU. So, I started running the script and ran the enforcement command with 0 devices as the enforcement rule. Genv detects that I am using more than I am allowed to:

User ekinkarabulut is using 1 devices which is 1 more than the maximum allowed
Detaching environment 43155 of user ekinkarabulut from device 0

It detaches the genv environment from the device. I can't see any device attached when I run genv devices:

ID      ENV ID      ENV NAME        ATTACHED
0
1

However, it doesn’t terminate the process so my job is still running (I can see it running when I check nvidia-smi):


Wed Aug  2 09:47:34 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    75W / 149W |    505MiB / 11441MiB |     43%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:05.0 Off |                    0 |
| N/A   38C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     43155      C   ray::_wrapper                     502MiB |
+-----------------------------------------------------------------------------+

Enforcement with sudo using sudo -E env PATH="$PATH" genv enforce --interval 3 --max-devices-per-user 0 is giving the same result.

P.s.: To make sure, I also ran another script within a genv environment to make sure that it is not a general issue and enforced the same thing - it terminates the process smoothly with normal scripts without ray. It seems to be an issue for Ray integration

zsh error

Hi,

Great project. I started to play with it, and seems to work well in bash, but it fails immediately in Zsh.

> genv activate                     
_genv_backup_env:2: bad substitution

Probably you used a bash-specific idiom which doesn't work in Zsh.

As a generic advice, maybe running shellcheck could give you improvement ideas:
https://github.com/koalaman/shellcheck