run-ai / docs Goto Github PK

View Code? Open in Web Editor NEW

61.0 61.0 57.0 119.1 MB

markdown docs

HTML 5.67% Dockerfile 10.43% Shell 47.87% Python 36.03%

docs's People

Contributors

Stargazers

Watchers

docs's Issues

how to share the files between the master and a container

Hello.

In general, if we want to share files, we will use the following command:

"docker run -i -t -v /home/dir:/container/dir ubuntu /bin/bash"

But how can I achieve this purpose with "runai submit" command ?

v2.13/admin/runai-setup/self-hosted/k8s/additional-clusters/

Discussed in #397

^{Originally posted by giscus[bot] July 12, 2023}

v2.13/admin/runai-setup/self-hosted/k8s/additional-clusters/

| Installing additional Clusters | Installation.

https://docs.run.ai/v2.13/admin/runai-setup/self-hosted/k8s/additional-clusters/

Help message suggests cluster URL may use an IP, but the code will only accept URL

Hello. When trying to authenticate to Run:ai the input form has this in the help message next to the field labeled "Cluster URL":

The Run:ai user interface requires a URL or IP address of the Kubernetes cluster (e.g. https://143.23.55.2 or https://cluster.myorg.com)

However, when I use the IP of ingress controller (which brings us to the next problem), I receive the following error:

runai system is not yet available due to: enabled operands handling error: Ingress.extensions "researcher-service-ingress" is invalid: spec.rules[0].host: Invalid value: "10.150.98.170": must be a DNS name, not an IP address

There's no such thing as "Cluster URL"

Not only this help message is contradictory to the implementation, there's no way to tell what did you mean when you wrote "Cluster URL". Please name it in a way that describes what you actually wanted this to be and change the wording of the help tooltip to reflect that. There's no need to include examples in the help message: users who can make it as far as that message had already opened a Web browser before and have seen examples of URLs.

Following the k8s install docs yields errors

Following the doc here, I get the following errors on a fresh install of Ubuntu 22.10:

...
Err:6 https://packages.cloud.google.com/apt kubernetes-xenial InRelease
  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY B53DC80D13EDEF05
Reading package lists... Done
W: GPG error: https://packages.cloud.google.com/apt kubernetes-xenial InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY B53DC80D13EDEF05
E: The repository 'https://apt.kubernetes.io kubernetes-xenial InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
 installing kubectl kubeadm kubelet...
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done

No apt package "kubeadm", but there is a snap with that name.
Try "snap install kubeadm"


No apt package "kubectl", but there is a snap with that name.
Try "snap install kubectl"


No apt package "kubelet", but there is a snap with that name.
Try "snap install kubelet"

E: Unable to locate package kubelet
E: Unable to locate package kubeadm
E: Unable to locate package kubectl

(Chromium) Run:ai binary download documentation page misses the part that allows download

This page: https://docs.run.ai/admin/researcher-setup/cli-install/#install-runai-cli reads:

Go to the Run:ai user interface. On the top right select Researcher Command Line Interface.
Select Mac or Linux.
Download directly using the button or copy the command and run it on a remote machine
...

However, when pressing on Researcher Command Line Interface the user is simply redirected to the documentation page again.

Other users reported that when using Google Chrome or Brave they are actually able to reach the download form, so, it may be a compatibility issue. The users who succeed in reaching the download page also show the screenshot with the drop-down

Documentation Center
Administrator API
Researcher Command Line Interface
Contact Support

to look different (the Researcher Command Line Interface item doesn't have an "arrow poking out of the box" icon next to it).

Unlabeled Node Roles

The following is described under node roles doc:

## Dedicated GPU & CPU Nodes

Separate nodes into those that:

* Run GPU workloads
* Run CPU workloads
* Do not run Run:ai at all. these jobs will not be monitored using the Run:ai Administration User interface.

This is actually not true, all nodes in the cluster are displayed under Nodes tab in the Administration UI.
That includes Run:ai worker nodes, Run:ai system nodes, regular workers, and cluster masters.

All nodes containing GPUs and having DCGM exporting metrics upon them, would count as "GPU nodes" in the Overview dashboard.
That includes nodes that don't have the runai-container-toolkit & runai-container-toolkit-exporter DaemonSets running on them - that means that any Run:ai pod won't be scheduled upon them, but they are still counted.

Review nodes names using `kubectl get nodes`. For each such node run:

'```
runai-adm set node-role --gpu-worker <node-name>
'```

or 

'```
runai-adm set node-role --cpu-worker <node-name>
'```

Nodes not marked as GPU worker or CPU worker will not run Run:ai at all.

That's also not true, nodes that are not marked as GPU workers nor CPU workers would run any kind of Run:ai workload.
The same behavior will be achieved if both roles are assigned to a node.

Upgrade Run:ai Airgapped 2.8.X

In Run:ai installation tar (for airgapped), you could find the following:

deploy/
|__...
|__runai-backend/
|__runai-backend-<version>.tgz
|__...

The runai-backend is an empty folder, while the runai-backend-<version>.tgz should be inside of it according to the docs:

helm upgrade runai-backend runai-backend/runai-backend-<version>.tgz -n \
    runai-backend  -f runai-backend-values.yaml

Therefore, the tar should be moved into it, or otherwise, the command in the docs should be changed to ./runai-backend-<version>.tgz instead.

Typo in runai-submit-mpi.md

There is a typo in docs/docs/Researcher/cli-reference/runai-submit-mpi.md.

In Examples, there is a flag --num-processes=2 where it should be --processes=2.

RunAI CLI

Opening an issue here as there isn't the possibility to open one in the RunAI CLI repository:

https://github.com/run-ai/runai-cli/releases
On your latest version 2.3.1 your install script copies 'charts' into the installation folder, however it is not present in the install files. For now I've copied over the 'charts' folder from your version 2.3.0 and it seems to work.

Additionally you're missing an uninstall option to remove the command line interface in a user friendly manner.

Audit Log - Broken link

The Audit Log page refers to a broken link:

To retrieve the Audit log you need to call an API. You can do this via code or by using the Audit function via a [user interface for calling APIs](https://yaron.runailabs.net/api/docs/#/Audit/get_v1_k8s_audit){target=_blank}.

The link redirects to a site that cannot be reached.

WARN[0000] Error in getting user details: token is missing the 'username' claim

Hello.

I'm trying to understand whether this is a configuration error on my part, or is this a problem with Run:ai.

I'm getting this warning in following contexts:

root@ose-t-u2004-02-28-1:~# ./runai-cli-linux-amd64 list jobs -p test1
WARN[0000] Error in getting user details: token is missing the 'username' claim 
ERRO[0000] project test1 does not exist. Run 'runai list project' to view all available projects 
root@ose-t-u2004-02-28-1:~# ./runai-cli-linux-amd64 list jobs -A
WARN[0000] Error in getting user details: token is missing the 'username' claim 
WARN[0000] Error in getting user details: token is missing the 'username' claim 
NAME  STATUS  AGE  NODE  IMAGE  TYPE  PROJECT  USER  GPUs Allocated (Requested)  PODs Running (Pending)  SERVICE URL(S)

As you can see, sometimes it results in error, and other times -- not so much.

Additionally, the jobs are actually running since:

root@ose-t-u2004-02-28-1:~# ./runai-cli-linux-amd64 describe job train1 -p test1 | head -5
WARN[0000] Error in getting user details: token is missing the 'username' claim 
Name: train1
Namespace: runai-test1
Type: Train
Status: Running
Duration: 9m

succeeds, but with the same warning.

Also, when trying to list projects, none of them are found:

root@ose-t-u2004-02-28-1:~# ./runai-cli-linux-amd64 list project                                                                                                                                                                            
WARN[0000] Error in getting user details: token is missing the 'username' claim 
PROJECT  DEPARTMENT  DESERVED GPUs  ALLOCATED GPUs  INT LIMIT  INT AFFINITY  TRAIN AFFINITY  MANAGED NAMESPACE
root@ose-t-u2004-02-28-1:~#

Even though I was able to submit the job and execute describe on it.

So, what is going on?

NB. It would be nice if instead of the warning message being directed at developers, the message would convey useful information to the users (administrators are users for this purpose). There's no way to understand for the user what token is this message talking about.

Update OCP entitlement requirements

Starting OCP 4.9.9 entitlement are not required anymore

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/steps-overview.html#entitlement-free-supported-versions

Unable to submit Inference workloads using quick start docs

Hi, I was trying to deploy inference workload using UI and as well as YAML. Here is the YAML. The example here was taken from quickstart guide for Run:ai 2.15 docs.

apiVersion: run.ai/v2alpha1
kind: InferenceWorkload
metadata:
  name: inference1
  namespace: runai-demo
spec:
  name:
    value: inference1
  gpu:
    value: "0.5"
  image:
    value: "gcr.io/run-ai-demo/example-triton-server"
  minScale:
    value: 1
  maxScale:
    value: 2
  metric:
    value: concurrency
  target:
    value: 80
  ports:
      items:
        port1:
          value:
            container: 8000
            protocol: http

I get the following error

Error from server (validation failed: must not set the field(s): spec.template.spec.schedulerName, spec.template.spec.securityContext): error when creating "inferenceworkload.yaml": admission webhook "workload-controller.runai.svc" denied the request: validation failed: must not set the field(s): spec.template.spec.schedulerName, spec.template.spec.securityContex

This was done on openshift 4.13.

The docs are inconsistent. There is supposed to be an On-prem tab just like

The docs are inconsistent. There is supposed to be an On-prem tab just like
there is for 2.9. Not sure why you would get rid of an on-prem tab when
it’s the majority of deployment models.

On Mon, Jul 24, 2023 at 11:13 PM Yaron @.***> wrote:

It says "Follow the Getting Started guide to install the NVIDIA GPU
Operator, or see the distribution-specific instructions below...." So there
are a number of supported environments (including native k8s which you
mention) that fall under this catch-all phrase...
Since you are the native English speaker here, you are welcome to
re-phrase it and send a pull request.

I will change to 16GB @kirson-git https://github.com/kirson-git

—
Reply to this email directly, view it on GitHub
#411 (comment),
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AZPAV763YUBTH77XD3A7YITXR5BU7ANCNFSM6AAAAAA2OAY57I
.
You are receiving this because you commented.Message ID:
@.***>

Regards,

Michael Burrows

Solution Architect

Customer Success

www.run.ai

Originally posted by @runneramb in #411 (comment)

VISIBLE=true does nothing useful

This VISIBLE=true does nothing useful, perhaps remove to avoid confusion?

docs/quickstart/x-forwarding/docker/Dockerfile

Lines 15 to 16 in d17df21

 ENV NOTVISIBLE="in users profile" 

 RUN echo "export VISIBLE=now" >> /etc/profile

Elaborate about runai-reservation namespace

When installing Run:ai, a namespace is created as part of the cluster named runai-reservation.
This namespace is purposed to reserve GPUs that are used for jobs with fractional GPUs.

When a new job with fractional GPU is submitted, a new pod is created within runai-reservation namespace and is responsible for preventing the "full GPU" workload from using that GPU.

There is no reference for that namespace at all in the docs.
An official elaboration would be great :)

	ENV NOTVISIBLE="in users profile"
	RUN echo "export VISIBLE=now" >> /etc/profile

run-ai / docs Goto Github PK

docs's People

Contributors

Stargazers

Watchers

Forkers

docs's Issues

Discussed in #397

v2.13/admin/runai-setup/self-hosted/k8s/additional-clusters/

There's no such thing as "Cluster URL"

Recommend Projects

Recommend Topics

Recommend Org