dstackai / dstack Goto Github PK
View Code? Open in Web Editor NEWdstack is an easy-to-use and flexible container orchestrator for running AI workloads in any cloud or data center.
Home Page: https://dstack.ai
License: Mozilla Public License 2.0
dstack is an easy-to-use and flexible container orchestrator for running AI workloads in any cloud or data center.
Home Page: https://dstack.ai
License: Mozilla Public License 2.0
Currently, the user doesn't see why a run is failing...
Examples:
etc
In case of stopping, send SIGINT and wait until the job finishes.
In case of aborting, immediately kill the job.
Also, make sure we don't wait extra time when cleaning up job resources.
It will be great to have an option either to specify the time to launch for the run or just say "this run to start in 2 hours"
dstack artifacts download doesn't tell you anything if there is a typo in a run name. would be nice to make some warning
Regularly happens if you work with ipynb notebooks locally and going to submit a python file regardless if the latter was changed or not.
sometimes fails on the stage or dstack run
with requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://api.dstack.ai/runs/submit
what is worse, sometimes this successfully goes to the servers but fails there without any notification
Currently, in order to use dstack, the user needs to either have an existing cloud account or own hardware.
It would be great if dstack provided its own compute provider and allow users to use dstack without having their own cloud account or hardware.
On one hand, dstack could provide a number of free GPU hours for the trial.
On the other hand, dstack could provide a way to pay for the spent hours, e.g. via a card.
Questions:
using dstack artifacts upload
I've provided the tag I wanted to assign to the data. Unfortunately, further runs depending on this data were failing without any logs. I've just removed the tag from the data (after it was successfully uploaded) and assigned it (the same tag) once again. Without any changes in the local repo, the code was resurrected and launched easily on dstack.
This relates to the new UI, where we hide jobs.
If a run or a job is restarted on the same runner, the runner tries to apply the Git patch (repo diff
) and fails because of a conflict as it's trying to apply it to the folder where it has already applied the patch.
Steps to reproduce:
Expected:
Actual:
Log:
ERRO[2022-05-25T11:21:58Z] diff applier error ae=ApplyError{Fragment: 1, FragmentLine: 3, Line: 3} run_name=odd-rabbit-1 job_id=e7fa162e70b1 workflow=train-mnist filename=.dstack/variables.yaml err=conflict: fragment line does not match src line
ERRO[2022-05-25T11:21:58Z] run job is finished with error job_id=e7fa162e70b1 err=conflict: fragment line does not match src line workflow=train-mnist run_name=odd-rabbit-1
INFO[2022-05-25T11:24:57Z] New job submitted job_id=e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1
WARN[2022-05-25T11:24:57Z] count of log arguments must be odd job_id=e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1 count=1
INFO[2022-05-25T11:24:58Z] git checkout path=/root/.dstack/tmp/runs/odd-rabbit-1/e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1 url=https://github.com/dstackai/dstack-examples.git branch=main hash=f219066b2379c69263f281f65167c8f6046874a2 job_id=e7fa162e70b1 auth=*http.BasicAuth
WARN[2022-05-25T11:24:58Z] git clone ref==nil branch=main hash=f219066b2379c69263f281f65167c8f6046874a2 job_id=e7fa162e70b1 path=/root/.dstack/tmp/runs/odd-rabbit-1/e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1 url=https://github.com/dstackai/dstack-examples.git
INFO[2022-05-25T11:24:58Z] apply diff start run_name=odd-rabbit-1 dir=/root/.dstack/tmp/runs/odd-rabbit-1/e7fa162e70b1 job_id=e7fa162e70b1 workflow=train-mnist
ERRO[2022-05-25T11:24:58Z] diff applier error job_id=e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1 filename=.dstack/variables.yaml err=conflict: fragment line does not match src line ae=ApplyError{Fragment: 1, FragmentLine: 3, Line: 3}
ERRO[2022-05-25T11:24:58Z] run job is finished with error run_name=odd-rabbit-1 job_id=e7fa162e70b1 err=conflict: fragment line does not match src line workflow=train-mnist
Now, every job may have its own environment variables set by the provider – see the property environment
in the job. It's a map of string to string. The runner should pass these environment variables to the job container.
Currently, if the dif is larger than 400KB, dstack CLI fails to submit the run.
Would be nice to be able to set a tag for the run right from the console like
dstack run train-model --tag latest
Currently, the user sees the logs from the provider in the run logs.
We should treat them as runner logs and not as run logs so the user doesn't see them.
If I use dstack init
on a repository that is using SSH, dstack should be able to parse ~/.ssh/config
automatically.
Now, the /runners/ping
response inside users
provides secrets
with the list of secrets to pass as environment variables to the jobs.
In the tutorial, the data is downloaded via the library which is not customisable enough.
It would be nice to have an option to pass the data to the execution environment. For example, it may be a tag in the workflows for the data to be taken from the specified path to the aws instance.
Thank you in advance!
The command should work similarly to dstack run
but instead of creating new jobs, it should change the existing jobs to theSubmitted
status.
There are tons of dependencies apart from the ones passed by the user. These dependencies are installed each time the run is submitted. It would be nice to optimize this part.
Ideas:
Containers from colab\kaggle would be really nice as they are +- classical and have expected behaviour regarding popular libraries
Here's one way to do it:
runner.yaml
. Add config --cuda <...>
argument to dstack-runner.${{ cuda }} within jobs'
image_namewith the configured CUDA version. Do the same for the docker image that is used to run
nvidia-smi`.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.