google / caliban Goto Github PK

View Code? Open in Web Editor NEW

491.0 21.0 69.0 2.4 MB

Research workflows made easy, locally and in the Cloud.

Home Page: https://caliban.readthedocs.io

License: Apache License 2.0

Dockerfile 0.71% Makefile 0.27% Python 97.79% Shell 0.56% TeX 0.65% Emacs Lisp 0.01%

docker python3 research-tool google-cloud ai-platform

caliban's Introduction

Caliban

Caliban is a tool that helps researchers launch and track their numerical experiments in an isolated, reproducible computing environment. It was developed by machine learning researchers and engineers, and makes it easy to go from a simple prototype running on a workstation to thousands of experimental jobs running on Cloud.

With Caliban, you can:

Develop your experimental code locally and test it inside an isolated (Docker) environment
Easily sweep over experimental parameters
Submit your experiments as Cloud jobs, where they will run in the same isolated environment
Control and keep track of jobs

Quickstart

Install Docker, make sure it's running, then install Caliban (you'll need Python >= 3.6):

pip install caliban

Train a simple deep learning model on your local machine:

git clone https://github.com/google/caliban.git && cd caliban/tutorials/basic
caliban run --nogpu mnist.py

Sweep over learning rates to find the best one (flags are specified in JSON format):

echo '{"learning_rate": [0.01, 0.001, 0.0001]}' | caliban run --experiment_config stdin --nogpu mnist.py

Next:

See how to submit the experiment to Cloud and use other Caliban features in "Getting Started with Caliban"
See Installation for detailed installation instructions
Read the Command Overview for info on Caliban commands.

Full documentation for Caliban lives at Read The Docs.

Dramatic Interlude

“Be not afeard; the isle is full of noises,
Sounds, and sweet airs, that give delight and hurt not.
Sometimes a thousand twangling instruments
Will hum about mine ears; and sometime voices,
That, if I then had waked after long sleep,
Will make me sleep again: and then, in dreaming,
The clouds methought would open, and show riches
Ready to drop upon me; that, when I waked,
I cried to dream again.”

-- Shakespeare, The Tempest

Installation and Prerequisites

Caliban's prequisites are Docker and Python >= 3.6.

Make sure your Python is up to date:

$ python --version
Python 3.6.9 # should be >=3.6.0

If not, visit "Installing Python 3.6" before proceeding.

Next, install Caliban via pip:

pip install -U caliban

check if your installation worked by navigating to an empty folder and running caliban --help. You should see the usage dialogue:

$ caliban --help
usage: caliban [-h] [--helpfull] [--version]
               {shell,notebook,build,run,cloud,cluster,status,stop,resubmit}
               ...

Docker

Caliban executes your code inside a "container", managed by Docker. To get Docker:

On MacOS, follow the installation instructions at Docker Desktop and start the newly-installed Docker Desktop application.
On Linux, visit the Docker installation instructions. (It's important that you configure sudo-less Docker and start Docker running on your machine.)

Make sure Docker is correctly installed, configured and running by executing the following command:

docker run hello-world

You should see output that looks like this:

...
Hello from Docker!
This message shows that your installation appears to be working correctly.
...

Python 3.6

Make sure your Python version is up to date:

$ python --version
Python 3.6.9 # should be >=3.6.0

If you need to upgrade:

On MacOS, install the latest Python version from python.org (direct link).
On Linux, run sudo apt-get update && sudo apt-get install python3.7.

Cloud Submission and GPUs

Caliban's Read the Docs documentation has instructions on:

Installing the nvidia-docker2 runtime, so you can use Caliban to run jobs that use your Linux machine's GPU.
Setting up a Google Cloud account so you can submit your code to Google's Cloud AI Platform with caliban cloud.

Getting Started with Caliban

In this section we will use Caliban to train an image classification network (implemented in TensorFlow). We will:

Train a neural network on the local machine
Increase the model's accuracy by changing the learning rate with a command-line flag
Sweep across a range of learning rates with Caliban's experiment broadcasting feature
Train the model in the Cloud on Google's AI Platform
Develop code interactively using caliban shell in the exact same environment.

Preparing your Project

Create an empty directory and use curl to download a python script that trains a basic neural network.

mkdir demo && cd demo
curl --output mnist.py https://raw.githubusercontent.com/google/caliban/main/tutorials/basic/mnist.py

Create a file called requirements.txt to declare tensorflow-cpu as a dependency:

echo "tensorflow-cpu" > requirements.txt

Caliban will automatically make any entry in requirements.txt available when you run your code. See "Declaring Requirements" for more information.

Training the Network

Run this command to train your first ML model:

caliban run --nogpu mnist.py

You should see a stream of output ending in this:

Training model with learning rate=0.1 for 3 epochs.
Epoch 1/3
1875/1875 - 3s - loss: 2.0989 - accuracy: 0.2506
Epoch 2/3
1875/1875 - 3s - loss: 1.9222 - accuracy: 0.2273
Epoch 3/3
1875/1875 - 3s - loss: 2.0777 - accuracy: 0.1938
Model performance:
313/313 - 0s - loss: 2.0973 - accuracy: 0.1858

Your model was able to recognize digits from the MNIST dataset with 18.58% accuracy. Can we do better?

Improving the Model

The default learning rate is 0.1. Run the code again with a smaller learning rate by passing a command-line flag, separated from your original command by --:

$ caliban run --nogpu mnist.py -- --learning_rate 0.01

<<elided>>

Training model with learning rate=0.01 for 3 epochs.
Epoch 1/3
1875/1875 - 4s - loss: 0.2676 - accuracy: 0.9221
Epoch 2/3
1875/1875 - 4s - loss: 0.1863 - accuracy: 0.9506
Epoch 3/3
1875/1875 - 4s - loss: 0.1567 - accuracy: 0.9585
Model performance:
313/313 - 0s - loss: 0.1410 - accuracy: 0.9642

96% accuracy! Much better! Can we do better still?

Experiment Broadcasting

Caliban's experiment broadcasting feature will allow us to run many jobs with different sets of arguments.

Create a file called experiment.json with a JSON dictionary of the format {"flag_name": ["list", "of", "values"]}:

echo '{"learning_rate": [0.01, 0.001, 0.0001]}' > experiment.json

Pass the config with --experiment_config and run again:

caliban run --experiment_config experiment.json --nogpu mnist.py

You should see accuracies of roughly 0.9493, 0.9723 and 0.9537. Looks like 0.001 is a nice choice.

Submitting to Cloud AI Platform

Now it's time to submit the job to Cloud AI Platform.

(NOTE: This section requires a Google Cloud account. You can create a free account with $300 of credit to get started. Follow Caliban's "Getting Started with Google Cloud" documentation, then come back here to proceed.)

Submit the job to AI Platform by changing the word run to cloud:

caliban cloud --nogpu mnist.py -- --learning_rate 0.01

You should see output like this:

I0615 19:57:43.354172 4563361216 core.py:161] Job 1 - jobId: caliban_totoro_1, image: gcr.io/research-3141/974a776e6037:latest
I0615 19:57:43.354712 4563361216 core.py:161] Job 1 - Accelerator: {'count': 0, 'type': 'ACCELERATOR_TYPE_UNSPECIFIED'}, machine: 'n1-highcpu-32', region: 'us-central1'
I0615 19:57:43.355082 4563361216 core.py:161] Job 1 - Experiment arguments: ['--learning_rate', '0.01']
I0615 19:57:43.355440 4563361216 core.py:161] Job 1 - labels: {'gpu_enabled': 'false', 'tpu_enabled': 'false', 'job_name': 'caliban_totoro', 'learning_rate': '0_01'}

I0615 19:57:43.356621 4563361216 core.py:324] Submitting request!
I0615 19:57:45.078382 4563361216 core.py:97] Request for job 'caliban_totoro_20200615_195743_1' succeeded!
I0615 19:57:45.078989 4563361216 core.py:98] Job URL: https://console.cloud.google.com/ai-platform/jobs/caliban_totoro_20200615_195743_1?projectId=totoro-project
I0615 19:57:45.079524 4563361216 core.py:100] Streaming log CLI command: $ gcloud ai-platform jobs stream-logs caliban_totoro_20200615_195743_1
Submitting caliban_totoro_1: 100%|####################################################################################################################################################################################| 1/1 [00:02<00:00,  2.65s/requests]
I0615 19:57:45.405600 4563361216 core.py:673]
I0615 19:57:45.405819 4563361216 core.py:676] Visit https://console.cloud.google.com/ai-platform/jobs/?projectId=research-3141 to see the status of all jobs.
I0615 19:57:45.405959 4563361216 core.py:677]

This output means that Caliban has:

built a Docker container with all of your code
Pushed that container up to Google Cloud's Container Registry
Submitted the job to AI Platform.

You can now visit the link in the output that looks like: https://console.cloud.google.com/ai-platform/jobs/caliban_totoro_20200615_195743_1?projectId=totoro-project to see all of your job's logs.

Why do I need Cloud?

With Google Cloud, you can use on-demand GPUs and TPUs and train models on large datasets at very high speeds. You can also customize the machine type that AI Platform uses to run your job. You might need high memory or more CPU, for example.

See Caliban's "Customizing Machines and GPUs" for more information.

Interactive Development with `caliban shell`

caliban shell lets you develop code interactively inside of the exact same environment that your code will have available, locally during caliban run or in the Cloud with caliban cloud.

Run the following command to activate the shell:

caliban shell --nogpu

You should see Caliban's terminal:

I0611 12:33:17.551121 4500135360 docker.py:911] Running command: docker run --ipc host -w /usr/app -u 735994:89939 -v /Users/totoro/code/example:/usr/app -it --entrypoint /bin/bash -v /Users/totoro:/home/totoro ab8a7d7db868
   _________    __    ________  ___    _   __  __  __
  / ____/   |  / /   /  _/ __ )/   |  / | / /  \ \ \ \
 / /   / /| | / /    / // __  / /| | /  |/ /    \ \ \ \
/ /___/ ___ |/ /____/ // /_/ / ___ |/ /|  /     / / / /
\____/_/  |_/_____/___/_____/_/  |_/_/ |_/     /_/ /_/

You are running caliban shell as user with ID 735994 and group 89939,
which should map to the ID and group for your user on the Docker host. Great!

[totoro@6a9b28990757 /usr/app]$

You're now living in an isolated Docker container with your tensorflow-cpu dependency available (and any others you've declared).

Run the python command and check that tensorflow is installed:

$ python
Python 3.6.9 (default, Nov  7 2019, 10:44:02)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.__version__
'2.2.0'

Your home directory and the folder where you ran the command are both mounted into this isolated environment, so any changes you make to either of those directories will be reflected immediately.

Any code you add to the current folder and edit on your computer will be available in this special Caliban shell. Run the example from before like this:

python mnist.py --learning_rate 0.01

If your code runs in caliban shell, you can be almost certain that your code will execute in a Cloud environment, with potentially many GPUs attached and much larger machines available.

What next?

Read the Overview for more information on Caliban's subcommands, then head over to Caliban's documentation site and check out the links on the sidebar.

If you find anything confusing, please feel free to create an issue on our Github Issues page, and we'll get you sorted out.

Command Overview

Caliban provides seven subcommands that you run inside some project directory on your machine:

caliban shell generates a Docker image containing any dependencies you've declared in a requirements.txt and/or setup.py in the directory and opens an interactive shell in that directory. The caliban shell environment is ~identical to the environment that will be available to your code when you submit it to AI Platform; the difference is that your current directory is live-mounted into the container, so you can develop interactively.
caliban notebook starts a Jupyter notebook or lab instance inside of a Docker image containing your dependencies; the guarantee about an environment identical to AI Platform applies here as well.
caliban run packages your directory's code into the Docker image and executes it locally using docker run. If you have a GPU, the instance will attach to it by default - no need to install the CUDA toolkit. The Docker environment takes care of all that. This environment is truly identical to the AI Platform environment. The Docker image that runs locally is the same image that will run in AI Platform.
caliban cloud allows you to submit jobs to AI Platform that will run inside the same Docker image you used with caliban run. You can submit hundreds of jobs at once. Any machine type, GPU count, and GPU type combination you specify will be validated client side, so you'll see an immediate error with suggestions, rather than having to debug by submitting jobs over and over.
caliban build builds the Docker image used in caliban cloud and caliban run without actually running the container or submitting any code.
caliban cluster creates GKE clusters and submits jobs to GKE clusters.
caliban status displays information about all jobs submitted by Caliban, and makes it easy to interact with large groups of experiments. Use caliban status when you need to cancel pending jobs, or re-build a container and resubmit a batch of experiments after fixing a bug.

Disclaimer

This is a research project, not an official Google product. Expect bugs and sharp edges. Please help by trying out Caliban, reporting bugs, and letting us know what you think!

Get Involved + Get Support

Pull requests and bug reports are always welcome! Check out our Contributor's Guide for information on how to get started contributing to Caliban.

The TL;DR; is:

send us a pull request,
iterate on the feedback + discussion, and
get a +1 from a Committer

in order to get your PR accepted.

Issues should be reported on the GitHub issue tracker.

If you want to discuss an idea for a new feature or ask us a question, discussion occurs primarily in the body of Github Issues, though the project is growing large enough that we may start a Gitter channel soon.

The current list of active committers (who can +1 a pull request) can be found here: COMMITTERS.md

A list of contributors to the project can be found at the project's Contributors page.

Citing Caliban

If Caliban helps you in your research, please consider citing Caliban's associated academic paper:

@article{Ritchie2020,
  doi = {10.21105/joss.02403},
  url = {https://doi.org/10.21105/joss.02403},
  year = {2020},
  publisher = {The Open Journal},
  volume = {5},
  number = {53},
  pages = {2403},
  author = {Sam Ritchie and Ambrose Slone and Vinay Ramasesh},
  title = {Caliban: Docker-based job manager for reproducible workflows},
  journal = {Journal of Open Source Software}
}

License

Licensed under the Apache License, Version 2.0.

caliban's People

Contributors

Stargazers

Watchers

caliban's Issues

Abrupt exit when training a model

Hi all,

Thanks for putting this package up! I really love the idea behind it and can't wait to integrate it more tightly with my workflow!

I'm trying to integrate Caliban with one of my smaller projects I'm working on here, but I'm having some trouble getting things to run. I added the requirements.txt file as instructed, but when I run the training script, I don't see any visible error and the process exits abruptly.

I'm using a Mac, and my data is stored at /Users/dilip.thiagarajan/data. Here's exactly what I did:

In that repository, I first tried running:

caliban run --nogpu --docker_run_args "--volume /Users/dilip.thiagarajan/data:/data" train.py -- --model_name resnet18 --projection_dim 64 --fast_dev_run True --download --data_dir /data

When I run this from the terminal, I see the following output:

dilip.thiagarajan simclr_pytorch % caliban run --nogpu --docker_run_args "--volume /Users/dilip.thiagarajan/data:/data" train.py -- --model_name resnet18 --projection_dim 64 --fast_dev_run True --download --data_dir /data                    
I0624 22:07:53.246673 4578139584 docker.py:614] Running command: docker build --rm -f- /Users/dilip.thiagarajan/code/simclr_pytorch
Sending build context to Docker daemon  110.6kB

Step 1/11 : FROM gcr.io/blueshift-playground/blueshift:cpu
 ---> fafdb20241ad
Step 2/11 : RUN [ $(getent group 20) ] || groupadd --gid 20 20
 ---> Using cache
 ---> 6b724e6c1e38
Step 3/11 : RUN useradd --no-log-init --no-create-home -u 502 -g 20 --shell /bin/bash dilip.thiagarajan
 ---> Using cache
 ---> 251bdcb68ec9
Step 4/11 : RUN mkdir -m 777 /usr/app /.creds /home/dilip.thiagarajan
 ---> Using cache
 ---> d2952e2052e3
Step 5/11 : ENV HOME=/home/dilip.thiagarajan
 ---> Using cache
 ---> d8c700640045
Step 6/11 : WORKDIR /usr/app
 ---> Using cache
 ---> 8d6fd0c9f3f4
Step 7/11 : USER 502:20
 ---> Using cache
 ---> 293fcdb3733f
Step 8/11 : COPY --chown=502:20 requirements.txt /usr/app
 ---> Using cache
 ---> 9074b050a5de
Step 9/11 : RUN /bin/bash -c "pip install --no-cache-dir -r requirements.txt"
 ---> Using cache
 ---> 60f28d41deb9
Step 10/11 : COPY --chown=502:20 . /usr/app/.
 ---> 74b6d6b6d42f
Step 11/11 : ENTRYPOINT ["python", "train.py"]
 ---> Running in 54a219fe9826
Removing intermediate container 54a219fe9826
 ---> 081b2c362108
Successfully built 081b2c362108
I0624 22:07:54.054889 4578139584 util.py:710] Restoring pure python logging
I0624 22:07:54.057392 4578139584 docker.py:707]                                                                                                                                                
I0624 22:07:54.057760 4578139584 docker.py:708] Job 1 - Experiment args: ['--model_name', 'resnet18', '--projection_dim', '64', '--fast_dev_run', 'True', '--download', '--data_dir', '/data'] 
I0624 22:07:54.057989 4578139584 docker.py:787] Running command: docker run --ipc host --volume /Users/dilip.thiagarajan/data:/data 081b2c362108 --model_name resnet18 --projection_dim 64 --fast_dev_run True --download --data_dir /data
Executing:   0%|                                                                                                                                                 | 0/1 [00:00<?, ?experiment/s]Downloading: "https://download.pytorch.org/models/resnet18-5c106cde.pth" to /home/dilip.thiagarajan/.cache/torch/checkpoints/resnet18-5c106cde.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 52.8MB/s]
Running in fast_dev_run mode: will run a full train, val and test loop using a single batch
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
/opt/conda/envs/caliban/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:25: RuntimeWarning: You have defined a `val_dataloader()` and have defined a `validation_step()`, you may also want to define `validation_epoch_end()` for accumulating stats.
  warnings.warn(*args, **kwargs)

  | Name            | Type            | Params
----------------------------------------------------
0 | model           | Sequential      | 11 M  
1 | projection_head | Linear          | 32 K  
2 | loss            | NTXEntCriterion | 0     
Files already downloaded and verified                                                                                                                                                          
Files already downloaded and verified                                                                                                                                                          
Training: 0it [00:00, ?it/s]                                                                                                                                                                   
Training:   0%|          | 0/2 [00:00<?, ?it/s]                                                                                                                                                
E0624 22:08:09.984529 4578139584 docker.py:747] Job 1 failed with return code 137.                                                                                                             
E0624 22:08:09.984878 4578139584 docker.py:750] Failing args for job 1: ['--model_name', 'resnet18', '--projection_dim', '64', '--fast_dev_run', 'True', '--download', '--data_dir', '/data']  
Executing: 100%|#########################################################################################################################################| 1/1 [00:15<00:00, 15.93s/experiment]

while when I output to log by doing

caliban run --nogpu --docker_run_args "--volume /Users/dilip.thiagarajan/data:/data" train.py -- --model_name resnet18 --projection_dim 64 --fast_dev_run True --download --data_dir /data &> caliban_run.log &

I see the following in my trace:

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/dilip.thiagarajan/.pyenv/versions/3.7.3/lib/python3.7/logging/__init__.py", line 2039, in shutdown
    h.close()
  File "/Users/dilip.thiagarajan/.pyenv/versions/3.7.3/lib/python3.7/site-packages/absl/logging/__init__.py", line 864, in close
    self.stream.close()
AttributeError: 'TqdmFile' object has no attribute 'close'

Is this a problem with some interaction with logging and tqdm? Or is it something I'm doing that's incorrect when I'm mounting my data directory?

The following works properly for me locally:
python3 train.py --model_name resnet18 --projection_dim 64 --fast_dev_run True --data_dir ~/data --download

Thanks for your help!

HTTP Error 403: Forbidden

Hi
I was working with caliban for a while but just now when running exactly the same code I am getting 403 error for downloading MNIST dataset:

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to datasets/MNIST/raw/train-images-idx3-ubyte.gz 0it [00:00, ?it/s]Traceback (most recent call last): File "mlp.py", line 257, in <module> main() File "mlp.py", line 187, in main train_dataset = load_data('train', args.dataset, args.datadir, nchannels) File "mlp.py", line 91, in load_data dataset = get_dataset(root=datadir, train=True, download=True, transform=tr_transform) File "/opt/conda/envs/caliban/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 79, in __init__ self.download() File "/opt/conda/envs/caliban/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 146, in download download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5) File "/opt/conda/envs/caliban/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 256, in download_and_extract_archive download_url(url, download_root, filename, md5) File "/opt/conda/envs/caliban/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 84, in download_url raise e File "/opt/conda/envs/caliban/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 72, in download_url reporthook=gen_bar_updater() File "/opt/conda/envs/caliban/lib/python3.7/urllib/request.py", line 247, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: File "/opt/conda/envs/caliban/lib/python3.7/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/opt/conda/envs/caliban/lib/python3.7/urllib/request.py", line 531, in open response = meth(req, response) File "/opt/conda/envs/caliban/lib/python3.7/urllib/request.py", line 641, in http_response 'http', request, response, code, msg, hdrs) File "/opt/conda/envs/caliban/lib/python3.7/urllib/request.py", line 569, in error return self._call_chain(*args) File "/opt/conda/envs/caliban/lib/python3.7/urllib/request.py", line 503, in _call_chain result = func(*args) File "/opt/conda/envs/caliban/lib/python3.7/urllib/request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden

Cannot create cluster

I want to create a GKE cluster following your instructions. I have set up the cloud tools, authentication, etc. I receive this error message:

$ caliban cluster create --cluster_name einsteintoolkit-cluster --zone us-east1-b
I0801 23:07:37.349657 4416302528 cli.py:185] creating cluster einsteintoolkit-cluster in project fifth-curve-272318 in us-east1-b...
I0801 23:07:37.349900 4416302528 cli.py:186] please be patient, this may take several minutes
I0801 23:07:37.349989 4416302528 cli.py:188] visit https://console.cloud.google.com/kubernetes/clusters/details/us-east1-b/einsteintoolkit-cluster?project=fifth-curve-272318 to monitor cluster creation progress
E0801 23:07:37.582320 4416302528 util.py:68] exception in call <function Cluster.create at 0x7ffc08ba4a70>:
<HttpError 400 when requesting https://container.googleapis.com/v1beta1/projects/fifth-curve-272318/zones/us-east1-b/clusters?alt=json returned "Resource_limit.maximum must be greater than 0.">
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/caliban-0.3.0+8.gaf9dd99-py3.7.egg/caliban/platform/gke/util.py", line 65, in wrapper
    response = fn(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/caliban-0.3.0+8.gaf9dd99-py3.7.egg/caliban/platform/gke/cluster.py", line 1178, in create
    rsp = request.execute()
  File "/opt/anaconda3/lib/python3.7/site-packages/google_api_python_client-1.10.0-py3.7.egg/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/google_api_python_client-1.10.0-py3.7.egg/googleapiclient/http.py", line 907, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://container.googleapis.com/v1beta1/projects/fifth-curve-272318/zones/us-east1-b/clusters?alt=json returned "Resource_limit.maximum must be greater than 0.">

When I use --dry_run, I see these details:

$ caliban cluster create --cluster_name einsteintoolkit-cluster --zone us-east1-b --dry_run
I0801 23:08:27.893903 4585823680 cli.py:175] request:
{'cluster': {'autoscaling': {'autoprovisioningNodePoolDefaults': {'oauthScopes': ['https://www.googleapis.com/auth/compute',
                                                                                  'https://www.googleapis.com/auth/cloud-platform']},
                             'enableNodeAutoprovisioning': 'true',
                             'resourceLimits': [{'maximum': '24',
                                                 'resourceType': 'cpu'},
                                                {'maximum': '1536',
                                                 'resourceType': 'memory'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-k80'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-p100'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-v100'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-p4'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-t4'},
                                                {'maximum': '0',
                                                 'resourceType': 'nvidia-tesla-a100'}]},
             'enable_tpu': 'true',
             'ipAllocationPolicy': {'useIpAliases': 'true'},
             'locations': ['us-east1-b', 'us-east1-c', 'us-east1-d'],
             'name': 'einsteintoolkit-cluster',
             'nodePools': [{'config': {'oauthScopes': ['https://www.googleapis.com/auth/devstorage.read_only',
                                                       'https://www.googleapis.com/auth/logging.write',
                                                       'https://www.googleapis.com/auth/monitoring',
                                                       'https://www.googleapis.com/auth/service.management.readonly',
                                                       'https://www.googleapis.com/auth/servicecontrol',
                                                       'https://www.googleapis.com/auth/trace.append']},
                            'initialNodeCount': '3',
                            'name': 'default-pool'}],
             'releaseChannel': {'channel': 'REGULAR'},
             'zone': 'us-east1-b'},
 'parent': 'projects/fifth-curve-272318/locations/us-east1-b'}

There is indeed a resource request with a maximum of 0.

I am using the current master branch.

Upgrade to modern dependencies [project]

Migrate container build, push steps to GitHub Actions? or lock in Cloud Build on our new project
get GPU base working
test new container on a cloud deploy with modern cloud services
container registry, still working?

Caliban should fail more gracefully when Docker isn't available

Docker image is rebuilt for every `cluster job submit`

I notice that Docker rebuilds the image for every cluster job submit, and pushes the respective changes to GCP. There is exactly one large layer that is pushed. I assume this is the working directory that is copied into the image. This is annoying since my working directory is quite large (it has a large executable), and I would like to speed this up.

I am not making any changes to the working directory. I don't know what changes trigger this. I assume that these are either spurious changes (Caliban/Docker doesn't detect that nothing actually changes), or maybe there is a time stamp that is needlessly updated.

(It would be convenient if caliban build had an option --dry_run that would output the Dockerfile that it generates.)

[JOSS review] community guidelines

For this check-list item:

Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

No. 1 is already in place, but I don't see clear guidelines for no. 2 and 3.

distirbuted training

How is it possible to run Caliban with distributed training? something like running torchrun ... locally.

Looking for a strangely-named image

Hello! Thanks again for this great project.

Running on a mac laptop, having installed caliban with pip, I tried the following:

$ caliban notebook --nogpu
I0204 10:58:44.096491 4624354752 build.py:731] Running command: docker build --rm -f- /Users/arokem/tmp/caliban_test
/Users/arokem/miniconda3/envs/caliban/lib/python3.8/subprocess.py:838: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  self.stdin = io.open(p2cwrite, 'wb', bufsize)
/Users/arokem/miniconda3/envs/caliban/lib/python3.8/subprocess.py:844: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  self.stdout = io.open(c2pread, 'rb', bufsize)
#1 [internal] load build definition from Dockerfile
#1 sha256:f28aea6def1a6cd99c3dba9a8588148c15a44dbfe43b481d262a0028d8dda975
#1 transferring dockerfile: 1.08kB 0.0s done
#1 DONE 0.0s

#2 [internal] load .dockerignore
#2 sha256:6dc01f396991620f90d338b881b4f6b1f6f0c9d6c01dfad942b60c6ef69159d7
#2 transferring context: 2B done
#2 DONE 0.0s

#3 [internal] load metadata for gcr.io/blueshift-playground/blueshift:cpu-ubuntu1804-py37
#3 sha256:25c05f5af803364da95cc3e34f6fd1b1fed810090a16c60858c3d711f73887c1
#3 DONE 0.3s

#4 [ 1/12] FROM gcr.io/blueshift-playground/blueshift:cpu-ubuntu1804-py37@sha256:ba03f280085e923a6be72af7d697cf209a17506cbe1963a927fc8633031c7fb0
#4 sha256:afa8668d82673191feeaa31b8753b5b616f1f9c9dbcb24c3e59516d900a74ef9
#4 DONE 0.0s

#10 [internal] load build context
#10 sha256:a35234ecc058bfe64aab295bb66df4d3bde95af86b97423e1400e05f86b78f8a
#10 transferring context: 248B done
#10 DONE 0.0s

#8 [ 5/12] WORKDIR /usr/app
#8 sha256:2018ebc767a5256774e0043b6c3a5b77d0fd9228fe9d1c25218eed5654b31f47
#8 CACHED

#5 [ 2/12] RUN [ $(getent group 20) ] || groupadd --gid 20 20
#5 sha256:83a3c5bf3f1567625775b104b115a569214db23e58cea0b09f683c85dfd0c12e
#5 CACHED

#9 [ 6/12] RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install --yes --no-install-recommends zsh && apt-get clean && rm -rf /var/lib/apt/lists/*
#9 sha256:152bc68698ec58a809ee62324e32d4a31182d396774266aadc4a704749cd0aaa
#9 CACHED

#7 [ 4/12] RUN mkdir -m 777 /usr/app /.creds /.resources /home/arokem
#7 sha256:ace23be941af73ca9b62052532466d06c8ff2656ebf85fb5107f3680461b7771
#7 CACHED

#13 [ 9/12] RUN pip install jupyter
#13 sha256:80d93a11752a3233671d7bef53df010723558d69cf6f3e03fa2369dce1c11247
#13 CACHED

#6 [ 3/12] RUN useradd --no-log-init --no-create-home -u 501 -g 20 --shell /bin/bash arokem
#6 sha256:8c0e7c3001be7c1c7ad87536bcfc588350fc5ab55a04b4f62370cf55249a3548
#6 CACHED

#14 [10/12] COPY --chown=501:20 cloud_sql_proxy.py /.resources
#14 sha256:10d002a7044af1e1cea195fed2a0ac90c7468733f5433146334f03f3097899d0
#14 CACHED

#11 [ 7/12] COPY --chown=501:20 requirements.txt /usr/app
#11 sha256:3d1c7cfc1fa38374f278e323ba6f19471d578bc0faaf13eb5f09cb6a28c809e6
#11 CACHED

#12 [ 8/12] RUN /bin/bash -c "pip install --no-cache-dir -r requirements.txt"
#12 sha256:ea84f2faf52c42eca47379f73b7718c2aaa561a1c60b21cedf5c97d0bdc5e6cb
#12 CACHED

#15 [11/12] COPY --chown=501:20 caliban_launcher.py /.resources
#15 sha256:13f784a9b4ae0df920d8e9b1a22d410508e4bae77af6693cd79d11e3d041a330
#15 CACHED

#16 [12/12] COPY --chown=501:20 ./caliban_launcher_cfg.json /.resources
#16 sha256:c9c4c3a49dee7f24c33b75dd28282056cfb21bb64885626da62e4bb90c55cf3b
#16 CACHED

#17 exporting to image
#17 sha256:e8c613e07b0b7ff33893b694f7759a10d42e180f2b4dc349fb57dc6b71dcab00
#17 exporting layers done
#17 writing image sha256:a06b3ff98fd03ae55ff165572f5a29559ebc8d16b0e78cf46986801d573cbdc8 done
#17 DONE 0.0s
I0204 10:58:44.979649 4624354752 run.py:335] Running command: docker run --ipc host -w /usr/app -u 501:20 -v /Users/arokem/tmp/caliban_test:/usr/app -it --entrypoint python -v /Users/arokem:/home/arokem -p 8889:8889 0.0s -m jupyter notebook --ip=0.0.0.0 --port=8889 --no-browser
Unable to find image '0.0s:latest' locally
docker: Error response from daemon: pull access denied for 0.0s, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.

Looks like maybe the docker command is malformed? Why is that 0.0s in there?

A way to provide my own docker image?

Is there a way to provide a Dockerfile or reference to a public docker image instead of relying on requirements.txt etc ?

Context: I'm working on a repo2docker Action that integrates with GitHub and want to explore automatically launching notebooks from a repo.

Google Cloud seems like a promising place to start! P.S. I found this tool through @rand

Feature request: support REES

REES is the specification supported by repo2docker. It would be great to support that. In particular, as I understand it (but correct me if I am wrong), caliban currently supports dependencies specified in a requirements.txt file. REES expands this to support dependencies in a number of other formats, including conda environment.yml files and even Dockerfiles. Implementing this would increase the range of things users could do with the software.

"Failed to read the container uri ... Please make sure that CloudML Engine service account has access to it"

Hello,

I am trying out the caliban cloud command in the demo (run locally works fine). It is running into permission issues when reading the container uri (listed below) after uploading the image:

command: caliban cloud --project_id xxx mnist.py -- --learning_rate 0.01

error:

core.py:103] Request for job 'caliban_ddohan_20200722_165145_1' failed! Details:
core.py:104] Field: master_config.image_uri Error: Failed to read the container uri [gcr.io/xxx/185c294caf60:latest]. Please make sure that CloudML Engine service account has access to it

caliban --version: caliban 0.2.6+8.gf95b955

Regards,
David

`caliban cloud`: providing project ID through CLI fails

I am seeing:

$caliban cloud --nogpu mnist.py -- --project_id landscape-238422


No project_id found. 'caliban cloud' requires that you either set a 
$PROJECT_ID environment variable with the ID of your Cloud project, or pass one 
explicitly via --project_id. Try again, please!

Setting the environment variable does seem to work.

Convert Caliban to Vertex AI from Cloud AI Platform

ModuleNotFoundError: No module named 'google'

I followed the instructions to install Caliban but running
caliban run --experiment_config config.json test.py
gives the following error:

File "/.resources/caliban_launcher.py", line 23, in
import google.auth
ModuleNotFoundError: No module named 'google'

I also checked and installed pip install google

update: after including google-cloud-storage in requirement.txt problem is solved

Make caliban fall back to cpu-only gracefully for local or shell commands

Currently if a user runs caliban shell either on a Mac they get this message:

'caliban shell' doesn't support GPU usage on Macs! Please pass --nogpu to use this command.

If they do so on a linux machine without GPU they get:

...
I1029 12:21:05.082031 139972099938112 run.py:335] Running command: docker run --runtime nvidia --ipc host -w /usr/app -u 102880:89939 -v /google/src/cloud/danielfurrer/xcloud/google3/learning/brain/frameworks/xcloud/examples/huggingface:/usr/app -it --entrypoint /bin/bash -v /usr/local/google/home/danielfurrer:/home/danielfurrer 5cc9af84e5cd
docker: Error response from daemon: Unknown runtime specified nvidia.

Would it be reasonable to just detect what runtimes are available and fall back to the behavior of --nogpu (perhaps with a warning)?

Missing newlines in generated Dockerfile when using GCP credentials

I started setting up GCP credentials etc., and am now seeing this error with caliban shell:

$ caliban shell --nogpu --docker_run_args '--volume /Users/eschnett/caliban-simulations:/caliban-simulations'
I0802 13:31:15.145786 4626697664 build.py:645] Running command: docker build --rm -f- /Users/eschnett/src/CarpetX

[...]

Step 8/8 : COPY --chown=501:20 .caliban_adc_creds.json /home/eschnett/.config/gcloud/application_default_credentials.jsonCOPY --chown=501:20 cloud_sql_proxy.py /.resources
COPY failed: stat /var/lib/docker/tmp/docker-builder529025309/home/eschnett/.config/gcloud/application_default_credentials.jsonCOPY: no such file or directory
E0802 13:31:33.397300 4626697664 main.py:165] Docker failed with error code 1.
E0802 13:31:33.397594 4626697664 main.py:166] Original command: docker build --rm -f- /Users/eschnett/src/CarpetX

The Dockerfile is missing a newline, so that two successive COPY statements run together.

Google-auth is not installed automatically

In order to run jobs on GCP, google-auth needs to be installed within the caliban venv. It seems like this is not the case, and unless google-auth is specified in the user's requirements.txt (or something that requires google-auth is specified in requirements.txt), jobs are unable to launch.

Issue with caliban package with installing using pip

There is an while installing the caliban package using pip. The error says python setup.py egg_info did not run successfully. I am attaching the command line output here.

Collecting caliban
  Using cached caliban-0.4.1-py3-none-any.whl (157 kB)
Collecting absl-py
  Using cached absl_py-1.2.0-py3-none-any.whl (123 kB)
Collecting google-auth>=1.19.0
  Using cached google_auth-2.11.0-py2.py3-none-any.whl (167 kB)
Collecting google-cloud-container>=0.3.0
  Downloading google_cloud_container-2.11.2-py2.py3-none-any.whl (202 kB)
     ---------------------------------------- 202.8/202.8 kB 2.4 MB/s eta 0:00:00
Collecting lark-parser<0.8.0,>=0.7.1
  Downloading lark-parser-0.7.8.tar.gz (276 kB)
     ---------------------------------------- 276.2/276.2 kB 2.8 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting blessings
  Downloading blessings-1.7-py3-none-any.whl (18 kB)
Collecting yaspin>=0.16.0
  Using cached yaspin-2.2.0-py3-none-any.whl (18 kB)
Collecting kubernetes>=10.0.1
  Using cached kubernetes-24.2.0-py2.py3-none-any.whl (1.5 MB)
Collecting commentjson
  Downloading commentjson-0.9.0.tar.gz (8.7 kB)
  Preparing metadata (setup.py) ... done
Collecting urllib3>=1.25.7
  Downloading urllib3-1.26.12-py2.py3-none-any.whl (140 kB)
     ---------------------------------------- 140.4/140.4 kB 8.1 MB/s eta 0:00:00
Collecting psycopg2-binary==2.8.5
  Using cached psycopg2-binary-2.8.5.tar.gz (381 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [25 lines of output]
      C:\Users\Hamza Aziz\AppData\Local\Programs\Python\Python310\lib\site-packages\setuptools\config\setupcfg.py:463: SetuptoolsDeprecationWarning: The license_file parameter is deprecated, use license_files instead.
        warnings.warn(msg, warning_class)
      running egg_info
      creating C:\Users\Hamza Aziz\AppData\Local\Temp\pip-pip-egg-info-1p1u8oyy\psycopg2_binary.egg-info
      writing C:\Users\Hamza Aziz\AppData\Local\Temp\pip-pip-egg-info-1p1u8oyy\psycopg2_binary.egg-info\PKG-INFO
      writing dependency_links to C:\Users\Hamza Aziz\AppData\Local\Temp\pip-pip-egg-info-1p1u8oyy\psycopg2_binary.egg-info\dependency_links.txt
      writing top-level names to C:\Users\Hamza Aziz\AppData\Local\Temp\pip-pip-egg-info-1p1u8oyy\psycopg2_binary.egg-info\top_level.txt
      writing manifest file 'C:\Users\Hamza Aziz\AppData\Local\Temp\pip-pip-egg-info-1p1u8oyy\psycopg2_binary.egg-info\SOURCES.txt'

      Error: pg_config executable not found.

      pg_config is required to build psycopg2 from source.  Please add the directory
      containing pg_config to the $PATH or specify the full executable path with the
      option:

          python setup.py build_ext --pg-config /path/to/pg_config build ...

      or with the pg_config option in 'setup.cfg'.

      If you prefer to avoid building psycopg2 from source, please install the PyPI
      'psycopg2-binary' package instead.

      For further information please check the 'doc/src/install.rst' file (also at
      <https://www.psycopg.org/docs/install.html>).

      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Documentation: Caliban Default Creds

Couple issues

gcloud service account credentials now required (as .caliban_default_creds) but not documented
Even when I supply working credentials, gcloud auth inside Docker is not working:

Step 9/18 : RUN gcloud auth activate-service-account --key-file=/.creds/credentials.json &&   git config --global credential.'https://source.developers.google.com'.helper gcloud.sh
 ---> Running in ef0260e778f6
ERROR: (gcloud.auth.activate-service-account) The .json key file is not in a valid format.
The command '/bin/sh -c gcloud auth activate-service-account --key-file=/.creds/credentials.json &&   git config --global credential.'https://source.developers.google.com'.helper gcloud.sh' returned a non-zero code: 1
E0820 10:49:44.640692 4603461056 main.py:165] Docker failed with error code 1.

Confirmation it works external to Docker:

gcloud auth activate-service-account --key-file ...
Activated service account credentials for: [...]

Insufficient quota in GCP free trial account

I want to create a cluster. After working around #65, I encounter a quota error because my 12 month free trial account apparently doesn't have enough IP addresses (I have 8, but need 12):

$ caliban cluster create --cluster_name einsteintoolkit-cluster --zone us-central1-a
I0801 23:46:03.689018 4536327616 cli.py:185] creating cluster einsteintoolkit-cluster in project fifth-curve-272318 in us-central1-a...
I0801 23:46:03.689273 4536327616 cli.py:186] please be patient, this may take several minutes
I0801 23:46:03.689368 4536327616 cli.py:188] visit https://console.cloud.google.com/kubernetes/clusters/details/us-central1-a/einsteintoolkit-cluster?project=fifth-curve-272318 to monitor cluster creation progress
W0801 23:46:06.863384 4536327616 http.py:123] Invalid JSON content from response: b'{\n  "error": {\n    "code": 403,\n    "message": "Insufficient regional quota to satisfy request: resource \\"IN_USE_ADDRESSES\\": request requires \'12.0\' and is short \'4.0\'. project has a quota of \'8.0\' with \'8.0\' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=fifth-curve-272318.",\n    "status": "PERMISSION_DENIED"\n  }\n}\n'
E0801 23:46:06.868414 4536327616 util.py:68] exception in call <function Cluster.create at 0x7feba0b13dd0>:
<HttpError 403 when requesting https://container.googleapis.com/v1beta1/projects/fifth-curve-272318/zones/us-central1-a/clusters?alt=json returned "Insufficient regional quota to satisfy request: resource "IN_USE_ADDRESSES": request requires '12.0' and is short '4.0'. project has a quota of '8.0' with '8.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=fifth-curve-272318.">
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/caliban-0.3.0+8.gaf9dd99.dirty-py3.7.egg/caliban/platform/gke/util.py", line 65, in wrapper
    response = fn(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/caliban-0.3.0+8.gaf9dd99.dirty-py3.7.egg/caliban/platform/gke/cluster.py", line 1178, in create
    rsp = request.execute()
  File "/opt/anaconda3/lib/python3.7/site-packages/google_api_python_client-1.10.0-py3.7.egg/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/google_api_python_client-1.10.0-py3.7.egg/googleapiclient/http.py", line 907, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 403 when requesting https://container.googleapis.com/v1beta1/projects/fifth-curve-272318/zones/us-central1-a/clusters?alt=json returned "Insufficient regional quota to satisfy request: resource "IN_USE_ADDRESSES": request requires '12.0' and is short '4.0'. project has a quota of '8.0' with '8.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=fifth-curve-272318.">

Free trial accounts cannot update their quota. Is there a way to request fewer IP addresses?

Feature request: Custom base images

I want to install additional software dependencies into an image, but this software is not available via apt, nor is it easy to install. To solve this, I want to run additional Docker stanzas in the image used by Caliban.

One way to do so would be to use a different base image (i.e. DEV_CONTAINER_ROOT); in that image, I would start from gcr.io/blueshift-playground/blueshift and then add my own software. I put a sample such Dockerfile at https://gist.github.com/eschnett/5390892b6d8348ea3be5ca35d95f4990.

Alternatively, one could specify these stanzas in .calibanconfig.json.

I tried the --image_id option, but this doesn't work. It works for caliban shell, but for caliban run it skips packing up the current directory.

Create base image based on Ubuntu 20.04 LTS

I need to use several software packages that are either not available in Ubuntu 18.04, or are outdated there. It would be quite convenient for me if the base image was using Ubuntu 20.04LTS instead of 18.04LTS.

For reference, I need these packages which are not available in 18.04:

cmake >= 3.14
gcc >= 8
libopenmpi-dev >= 3
libpsm2

I can, of course, install these from source, but this is time consuming every time the image is rebuilt.

Timeout when submitting request to deploy job

I'm running caliban to deploy a model to Google Cloud as follows:

caliban cloud --project 'taglia' --region 'us-central1' --machine_type 'n1-standa
rd-32' --gpu_spec '4xV100' train_panoptic.py -- <TRAIN SCRIPT ARGS>

I am getting stuck with the following error:

I0708 16:33:54.308651 139689314522944 core.py:324] Submitting request!                                                   
W0708 16:33:54.516765 139689314522944 http.py:171] Sleeping 1.11 seconds before retry 1 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:33:55.784710 139689314522944 http.py:171] Sleeping 2.46 seconds before retry 2 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:33:58.420140 139689314522944 http.py:171] Sleeping 5.17 seconds before retry 3 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:34:03.754786 139689314522944 http.py:171] Sleeping 1.37 seconds before retry 4 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:34:05.324873 139689314522944 http.py:171] Sleeping 5.63 seconds before retry 5 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:34:11.127667 139689314522944 http.py:171] Sleeping 13.70 seconds before retry 6 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:34:25.041211 139689314522944 http.py:171] Sleeping 124.27 seconds before retry 7 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:36:29.560709 139689314522944 http.py:171] Sleeping 79.67 seconds before retry 8 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:37:49.466338 139689314522944 http.py:171] Sleeping 351.15 seconds before retry 9 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:43:40.860112 139689314522944 http.py:171] Sleeping 656.62 seconds before retry 10 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
Submitting caliban_franciswi_google_com_1:   0%|                                             | 0/1 [09:46<?, ?requests/s]

I'm not sure what to do to resolve this.

Thanks!

Feature request: Perform a calculation remotely, and return the Docker image

Consider this scenario:

I am developing on a laptop that doesn't have much CPU power. I want to perform a computation (e.g. install a package from source) that is CPU intensive.

Here is an idea: I write a script for this action (e.g. make). Caliban wraps the local state into a container (caliban build), runs the script remotely (caliban submit), and then saves the resulting Docker image. Locally, I can then pull that image and extract the files I need, e.g. the compiled library.

The use case I have in mind is not about building the dependencies (i.e. to externalize building Docker images), but rather to perform actions that I would usually do via caliban shell. This is relevant e.g. for the Einstein Toolkit, or for other packages that contain a large amount of legacy code and which cannot practically be split into dependencies that can be declared in .cabalconfig.json and pre-built.

AWS Backend, similar to `caliban cloud`'s AI Platform support

There's no reason we can't build an AWS backend, to make it easier to users to work with whatever tooling they currently have.

I'm planning on spending some time organizing the codebase to make this easier. The big missing pieces that I'd love help with from the community are:

what AWS service is analogous to AI Platform? We want to submit some request with a Docker image ID, command line arguments, and hardware specs (GPUs, machine type etc), and have some AWS service run the job and then stop. Bonus if we can attach labels, etc
What authentication method is similar to Google' Service Account Key?

We have two auth requirements.

We need to authenticate with AWS to submit the job from the submitting machine.
We'd like to bake some credentials into the container so that users can authenticate with AWS's python library or command line interface and, say, transfer data to and from buckets, or talk to some AWS database.

For (2), amazon might mount credentials into the container, or we might have to grab them and bake them in, like we do with service account keys.

If someone would comment here with a rough guide (but the more detailed the better)! on how to do either of the above 2 manually, that would be a massive help in automating this.

(base) MAC0008052:DDF.jl erikschnetter$ caliban shell --nogpu
Traceback (most recent call last):
  File "/Users/erikschnetter/opt/anaconda3/bin/caliban", line 11, in <module>
    load_entry_point('caliban==0.2.6+9.gf59ace0', 'console_scripts', 'caliban')()
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/caliban-0.2.6+9.gf59ace0-py3.7.egg/caliban/main.py", line 164, in main
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/caliban-0.2.6+9.gf59ace0-py3.7.egg/caliban/main.py", line 74, in run_app
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/caliban-0.2.6+9.gf59ace0-py3.7.egg/caliban/platform/shell.py", line 113, in run_interactive
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/caliban-0.2.6+9.gf59ace0-py3.7.egg/caliban/platform/run.py", line 295, in run
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/caliban-0.2.6+9.gf59ace0-py3.7.egg/caliban/docker/build.py", line 628, in build_image
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/caliban-0.2.6+9.gf59ace0-py3.7.egg/caliban/docker/build.py", line 583, in _dockerfile_template
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/caliban-0.2.6+9.gf59ace0-py3.7.egg/caliban/config/__init__.py", line 164, in apt_packages
KeyError: 'apt_packages'

I believe the problem is that caliban_config creates an empty dictionary, and apt_packages later tries to access the key "apt_packages" in there.