run-ai / docs Goto Github PK
View Code? Open in Web Editor NEWmarkdown docs
markdown docs
Hello.
In general, if we want to share files, we will use the following command:
"docker run -i -t -v /home/dir:/container/dir ubuntu /bin/bash"
But how can I achieve this purpose with "runai submit" command ?
Originally posted by giscus[bot] July 12, 2023
| Installing additional Clusters | Installation.
https://docs.run.ai/v2.13/admin/runai-setup/self-hosted/k8s/additional-clusters/
Hello. When trying to authenticate to Run:ai the input form has this in the help message next to the field labeled "Cluster URL":
The Run:ai user interface requires a URL or IP address of the Kubernetes cluster (e.g. https://143.23.55.2 or https://cluster.myorg.com)
However, when I use the IP of ingress controller (which brings us to the next problem), I receive the following error:
runai system is not yet available due to: enabled operands handling error: Ingress.extensions "researcher-service-ingress" is invalid: spec.rules[0].host: Invalid value: "10.150.98.170": must be a DNS name, not an IP address
Not only this help message is contradictory to the implementation, there's no way to tell what did you mean when you wrote "Cluster URL". Please name it in a way that describes what you actually wanted this to be and change the wording of the help tooltip to reflect that. There's no need to include examples in the help message: users who can make it as far as that message had already opened a Web browser before and have seen examples of URLs.
Following the doc here, I get the following errors on a fresh install of Ubuntu 22.10:
...
Err:6 https://packages.cloud.google.com/apt kubernetes-xenial InRelease
The following signatures couldn't be verified because the public key is not available: NO_PUBKEY B53DC80D13EDEF05
Reading package lists... Done
W: GPG error: https://packages.cloud.google.com/apt kubernetes-xenial InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY B53DC80D13EDEF05
E: The repository 'https://apt.kubernetes.io kubernetes-xenial InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
installing kubectl kubeadm kubelet...
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
No apt package "kubeadm", but there is a snap with that name.
Try "snap install kubeadm"
No apt package "kubectl", but there is a snap with that name.
Try "snap install kubectl"
No apt package "kubelet", but there is a snap with that name.
Try "snap install kubelet"
E: Unable to locate package kubelet
E: Unable to locate package kubeadm
E: Unable to locate package kubectl
This page: https://docs.run.ai/admin/researcher-setup/cli-install/#install-runai-cli reads:
Researcher Command Line Interface
.Mac
or Linux
.However, when pressing on Researcher Command Line Interface
the user is simply redirected to the documentation page again.
Other users reported that when using Google Chrome or Brave they are actually able to reach the download form, so, it may be a compatibility issue. The users who succeed in reaching the download page also show the screenshot with the drop-down
to look different (the Researcher Command Line Interface
item doesn't have an "arrow poking out of the box" icon next to it).
The following is described under node roles doc:
## Dedicated GPU & CPU Nodes
Separate nodes into those that:
* Run GPU workloads
* Run CPU workloads
* Do not run Run:ai at all. these jobs will not be monitored using the Run:ai Administration User interface.
This is actually not true, all nodes in the cluster are displayed under Nodes
tab in the Administration UI.
That includes Run:ai worker nodes, Run:ai system nodes, regular workers, and cluster masters.
All nodes containing GPUs and having DCGM exporting metrics upon them, would count as "GPU nodes" in the Overview
dashboard.
That includes nodes that don't have the runai-container-toolkit
& runai-container-toolkit-exporter
DaemonSets running on them - that means that any Run:ai pod won't be scheduled upon them, but they are still counted.
Review nodes names using `kubectl get nodes`. For each such node run:
'```
runai-adm set node-role --gpu-worker <node-name>
'```
or
'```
runai-adm set node-role --cpu-worker <node-name>
'```
Nodes not marked as GPU worker or CPU worker will not run Run:ai at all.
That's also not true, nodes that are not marked as GPU workers nor CPU workers would run any kind of Run:ai workload.
The same behavior will be achieved if both roles are assigned to a node.
In Run:ai installation tar (for airgapped), you could find the following:
deploy/
|__...
|__runai-backend/
|__runai-backend-<version>.tgz
|__...
The runai-backend
is an empty folder, while the runai-backend-<version>.tgz
should be inside of it according to the docs:
helm upgrade runai-backend runai-backend/runai-backend-<version>.tgz -n \
runai-backend -f runai-backend-values.yaml
Therefore, the tar should be moved into it, or otherwise, the command in the docs should be changed to ./runai-backend-<version>.tgz
instead.
There is a typo in docs/docs/Researcher/cli-reference/runai-submit-mpi.md.
In Examples, there is a flag --num-processes=2
where it should be --processes=2
.
Opening an issue here as there isn't the possibility to open one in the RunAI CLI repository:
https://github.com/run-ai/runai-cli/releases
On your latest version 2.3.1 your install script copies 'charts' into the installation folder, however it is not present in the install files. For now I've copied over the 'charts' folder from your version 2.3.0 and it seems to work.
Additionally you're missing an uninstall option to remove the command line interface in a user friendly manner.
The Audit Log page refers to a broken link:
To retrieve the Audit log you need to call an API. You can do this via code or by using the Audit function via a [user interface for calling APIs](https://yaron.runailabs.net/api/docs/#/Audit/get_v1_k8s_audit){target=_blank}.
The link redirects to a site that cannot be reached.
Hello.
I'm trying to understand whether this is a configuration error on my part, or is this a problem with Run:ai.
I'm getting this warning in following contexts:
root@ose-t-u2004-02-28-1:~# ./runai-cli-linux-amd64 list jobs -p test1
WARN[0000] Error in getting user details: token is missing the 'username' claim
ERRO[0000] project test1 does not exist. Run 'runai list project' to view all available projects
root@ose-t-u2004-02-28-1:~# ./runai-cli-linux-amd64 list jobs -A
WARN[0000] Error in getting user details: token is missing the 'username' claim
WARN[0000] Error in getting user details: token is missing the 'username' claim
NAME STATUS AGE NODE IMAGE TYPE PROJECT USER GPUs Allocated (Requested) PODs Running (Pending) SERVICE URL(S)
As you can see, sometimes it results in error, and other times -- not so much.
Additionally, the jobs are actually running since:
root@ose-t-u2004-02-28-1:~# ./runai-cli-linux-amd64 describe job train1 -p test1 | head -5
WARN[0000] Error in getting user details: token is missing the 'username' claim
Name: train1
Namespace: runai-test1
Type: Train
Status: Running
Duration: 9m
succeeds, but with the same warning.
Also, when trying to list projects, none of them are found:
root@ose-t-u2004-02-28-1:~# ./runai-cli-linux-amd64 list project
WARN[0000] Error in getting user details: token is missing the 'username' claim
PROJECT DEPARTMENT DESERVED GPUs ALLOCATED GPUs INT LIMIT INT AFFINITY TRAIN AFFINITY MANAGED NAMESPACE
root@ose-t-u2004-02-28-1:~#
Even though I was able to submit the job and execute describe
on it.
So, what is going on?
NB. It would be nice if instead of the warning message being directed at developers, the message would convey useful information to the users (administrators are users for this purpose). There's no way to understand for the user what token is this message talking about.
Starting OCP 4.9.9 entitlement are not required anymore
Hi, I was trying to deploy inference workload using UI and as well as YAML. Here is the YAML. The example here was taken from quickstart guide for Run:ai 2.15 docs.
apiVersion: run.ai/v2alpha1
kind: InferenceWorkload
metadata:
name: inference1
namespace: runai-demo
spec:
name:
value: inference1
gpu:
value: "0.5"
image:
value: "gcr.io/run-ai-demo/example-triton-server"
minScale:
value: 1
maxScale:
value: 2
metric:
value: concurrency
target:
value: 80
ports:
items:
port1:
value:
container: 8000
protocol: http
I get the following error
Error from server (validation failed: must not set the field(s): spec.template.spec.schedulerName, spec.template.spec.securityContext): error when creating "inferenceworkload.yaml": admission webhook "workload-controller.runai.svc" denied the request: validation failed: must not set the field(s): spec.template.spec.schedulerName, spec.template.spec.securityContex
This was done on openshift 4.13.
The docs are inconsistent. There is supposed to be an On-prem tab just like
there is for 2.9. Not sure why you would get rid of an on-prem tab when
itβs the majority of deployment models.
On Mon, Jul 24, 2023 at 11:13 PM Yaron @.***> wrote:
It says "Follow the Getting Started guide to install the NVIDIA GPU
Operator, or see the distribution-specific instructions below...." So there
are a number of supported environments (including native k8s which you
mention) that fall under this catch-all phrase...
Since you are the native English speaker here, you are welcome to
re-phrase it and send a pull request.I will change to 16GB @kirson-git https://github.com/kirson-git
β
Reply to this email directly, view it on GitHub
#411 (comment),
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AZPAV763YUBTH77XD3A7YITXR5BU7ANCNFSM6AAAAAA2OAY57I
.
You are receiving this because you commented.Message ID:
@.***>
--
Regards,
Michael Burrows
Solution Architect
Customer Success
Originally posted by @runneramb in #411 (comment)
This VISIBLE=true does nothing useful, perhaps remove to avoid confusion?
docs/quickstart/x-forwarding/docker/Dockerfile
Lines 15 to 16 in d17df21
When installing Run:ai, a namespace is created as part of the cluster named runai-reservation
.
This namespace is purposed to reserve GPUs that are used for jobs with fractional GPUs.
When a new job with fractional GPU is submitted, a new pod is created within runai-reservation
namespace and is responsible for preventing the "full GPU" workload from using that GPU.
There is no reference for that namespace at all in the docs.
An official elaboration would be great :)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.