google / caliban Goto Github PK

View Code? Open in Web Editor NEW

490.0 19.0 69.0 2.4 MB

Research workflows made easy, locally and in the Cloud.

Home Page: https://caliban.readthedocs.io

License: Apache License 2.0

Dockerfile 0.71% Makefile 0.27% Python 97.79% Shell 0.56% TeX 0.65% Emacs Lisp 0.01%

docker python3 research-tool google-cloud ai-platform

caliban's Issues

Docker image is rebuilt for every `cluster job submit`

I notice that Docker rebuilds the image for every cluster job submit, and pushes the respective changes to GCP. There is exactly one large layer that is pushed. I assume this is the working directory that is copied into the image. This is annoying since my working directory is quite large (it has a large executable), and I would like to speed this up.

I am not making any changes to the working directory. I don't know what changes trigger this. I assume that these are either spurious changes (Caliban/Docker doesn't detect that nothing actually changes), or maybe there is a time stamp that is needlessly updated.

(It would be convenient if caliban build had an option --dry_run that would output the Dockerfile that it generates.)

distirbuted training

How is it possible to run Caliban with distributed training? something like running torchrun ... locally.

Make caliban fall back to cpu-only gracefully for local or shell commands

Currently if a user runs caliban shell either on a Mac they get this message:

'caliban shell' doesn't support GPU usage on Macs! Please pass --nogpu to use this command.

If they do so on a linux machine without GPU they get:

...
I1029 12:21:05.082031 139972099938112 run.py:335] Running command: docker run --runtime nvidia --ipc host -w /usr/app -u 102880:89939 -v /google/src/cloud/danielfurrer/xcloud/google3/learning/brain/frameworks/xcloud/examples/huggingface:/usr/app -it --entrypoint /bin/bash -v /usr/local/google/home/danielfurrer:/home/danielfurrer 5cc9af84e5cd
docker: Error response from daemon: Unknown runtime specified nvidia.

Would it be reasonable to just detect what runtimes are available and fall back to the behavior of --nogpu (perhaps with a warning)?

Feature request: Perform a calculation remotely, and return the Docker image

Consider this scenario:

I am developing on a laptop that doesn't have much CPU power. I want to perform a computation (e.g. install a package from source) that is CPU intensive.

Here is an idea: I write a script for this action (e.g. make). Caliban wraps the local state into a container (caliban build), runs the script remotely (caliban submit), and then saves the resulting Docker image. Locally, I can then pull that image and extract the files I need, e.g. the compiled library.

The use case I have in mind is not about building the dependencies (i.e. to externalize building Docker images), but rather to perform actions that I would usually do via caliban shell. This is relevant e.g. for the Einstein Toolkit, or for other packages that contain a large amount of legacy code and which cannot practically be split into dependencies that can be declared in .cabalconfig.json and pre-built.

"Failed to read the container uri ... Please make sure that CloudML Engine service account has access to it"

Hello,

I am trying out the caliban cloud command in the demo (run locally works fine). It is running into permission issues when reading the container uri (listed below) after uploading the image:

command: caliban cloud --project_id xxx mnist.py -- --learning_rate 0.01

error:

core.py:103] Request for job 'caliban_ddohan_20200722_165145_1' failed! Details:
core.py:104] Field: master_config.image_uri Error: Failed to read the container uri [gcr.io/xxx/185c294caf60:latest]. Please make sure that CloudML Engine service account has access to it

caliban --version: caliban 0.2.6+8.gf95b955

Regards,
David

Issue with caliban package with installing using pip

There is an while installing the caliban package using pip. The error says python setup.py egg_info did not run successfully. I am attaching the command line output here.

Collecting caliban
  Using cached caliban-0.4.1-py3-none-any.whl (157 kB)
Collecting absl-py
  Using cached absl_py-1.2.0-py3-none-any.whl (123 kB)
Collecting google-auth>=1.19.0
  Using cached google_auth-2.11.0-py2.py3-none-any.whl (167 kB)
Collecting google-cloud-container>=0.3.0
  Downloading google_cloud_container-2.11.2-py2.py3-none-any.whl (202 kB)
     ---------------------------------------- 202.8/202.8 kB 2.4 MB/s eta 0:00:00
Collecting lark-parser<0.8.0,>=0.7.1
  Downloading lark-parser-0.7.8.tar.gz (276 kB)
     ---------------------------------------- 276.2/276.2 kB 2.8 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting blessings
  Downloading blessings-1.7-py3-none-any.whl (18 kB)
Collecting yaspin>=0.16.0
  Using cached yaspin-2.2.0-py3-none-any.whl (18 kB)
Collecting kubernetes>=10.0.1
  Using cached kubernetes-24.2.0-py2.py3-none-any.whl (1.5 MB)
Collecting commentjson
  Downloading commentjson-0.9.0.tar.gz (8.7 kB)
  Preparing metadata (setup.py) ... done
Collecting urllib3>=1.25.7
  Downloading urllib3-1.26.12-py2.py3-none-any.whl (140 kB)
     ---------------------------------------- 140.4/140.4 kB 8.1 MB/s eta 0:00:00
Collecting psycopg2-binary==2.8.5
  Using cached psycopg2-binary-2.8.5.tar.gz (381 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [25 lines of output]
      C:\Users\Hamza Aziz\AppData\Local\Programs\Python\Python310\lib\site-packages\setuptools\config\setupcfg.py:463: SetuptoolsDeprecationWarning: The license_file parameter is deprecated, use license_files instead.
        warnings.warn(msg, warning_class)
      running egg_info
      creating C:\Users\Hamza Aziz\AppData\Local\Temp\pip-pip-egg-info-1p1u8oyy\psycopg2_binary.egg-info
      writing C:\Users\Hamza Aziz\AppData\Local\Temp\pip-pip-egg-info-1p1u8oyy\psycopg2_binary.egg-info\PKG-INFO
      writing dependency_links to C:\Users\Hamza Aziz\AppData\Local\Temp\pip-pip-egg-info-1p1u8oyy\psycopg2_binary.egg-info\dependency_links.txt
      writing top-level names to C:\Users\Hamza Aziz\AppData\Local\Temp\pip-pip-egg-info-1p1u8oyy\psycopg2_binary.egg-info\top_level.txt
      writing manifest file 'C:\Users\Hamza Aziz\AppData\Local\Temp\pip-pip-egg-info-1p1u8oyy\psycopg2_binary.egg-info\SOURCES.txt'

      Error: pg_config executable not found.

      pg_config is required to build psycopg2 from source.  Please add the directory
      containing pg_config to the $PATH or specify the full executable path with the
      option:

          python setup.py build_ext --pg-config /path/to/pg_config build ...

      or with the pg_config option in 'setup.cfg'.

      If you prefer to avoid building psycopg2 from source, please install the PyPI
      'psycopg2-binary' package instead.

      For further information please check the 'doc/src/install.rst' file (also at
      <https://www.psycopg.org/docs/install.html>).

      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Feature request: Custom base images

I want to install additional software dependencies into an image, but this software is not available via apt, nor is it easy to install. To solve this, I want to run additional Docker stanzas in the image used by Caliban.

One way to do so would be to use a different base image (i.e. DEV_CONTAINER_ROOT); in that image, I would start from gcr.io/blueshift-playground/blueshift and then add my own software. I put a sample such Dockerfile at https://gist.github.com/eschnett/5390892b6d8348ea3be5ca35d95f4990.

Alternatively, one could specify these stanzas in .calibanconfig.json.

I tried the --image_id option, but this doesn't work. It works for caliban shell, but for caliban run it skips packing up the current directory.

AWS Backend, similar to `caliban cloud`'s AI Platform support

There's no reason we can't build an AWS backend, to make it easier to users to work with whatever tooling they currently have.

I'm planning on spending some time organizing the codebase to make this easier. The big missing pieces that I'd love help with from the community are:

what AWS service is analogous to AI Platform? We want to submit some request with a Docker image ID, command line arguments, and hardware specs (GPUs, machine type etc), and have some AWS service run the job and then stop. Bonus if we can attach labels, etc
What authentication method is similar to Google' Service Account Key?

We have two auth requirements.

We need to authenticate with AWS to submit the job from the submitting machine.
We'd like to bake some credentials into the container so that users can authenticate with AWS's python library or command line interface and, say, transfer data to and from buckets, or talk to some AWS database.

For (2), amazon might mount credentials into the container, or we might have to grab them and bake them in, like we do with service account keys.

If someone would comment here with a rough guide (but the more detailed the better)! on how to do either of the above 2 manually, that would be a massive help in automating this.

Upgrade to modern dependencies [project]

Migrate container build, push steps to GitHub Actions? or lock in Cloud Build on our new project
get GPU base working
test new container on a cloud deploy with modern cloud services
container registry, still working?

Convert Caliban from Container Registry to Artifact Registry

Feature request: support REES

REES is the specification supported by repo2docker. It would be great to support that. In particular, as I understand it (but correct me if I am wrong), caliban currently supports dependencies specified in a requirements.txt file. REES expands this to support dependencies in a number of other formats, including conda environment.yml files and even Dockerfiles. Implementing this would increase the range of things users could do with the software.

Documentation: Caliban Default Creds

Couple issues

gcloud service account credentials now required (as .caliban_default_creds) but not documented
Even when I supply working credentials, gcloud auth inside Docker is not working:

Step 9/18 : RUN gcloud auth activate-service-account --key-file=/.creds/credentials.json &&   git config --global credential.'https://source.developers.google.com'.helper gcloud.sh
 ---> Running in ef0260e778f6
ERROR: (gcloud.auth.activate-service-account) The .json key file is not in a valid format.
The command '/bin/sh -c gcloud auth activate-service-account --key-file=/.creds/credentials.json &&   git config --global credential.'https://source.developers.google.com'.helper gcloud.sh' returned a non-zero code: 1
E0820 10:49:44.640692 4603461056 main.py:165] Docker failed with error code 1.

Confirmation it works external to Docker:

gcloud auth activate-service-account --key-file ...
Activated service account credentials for: [...]

Insufficient quota in GCP free trial account

I want to create a cluster. After working around #65, I encounter a quota error because my 12 month free trial account apparently doesn't have enough IP addresses (I have 8, but need 12):

$ caliban cluster create --cluster_name einsteintoolkit-cluster --zone us-central1-a
I0801 23:46:03.689018 4536327616 cli.py:185] creating cluster einsteintoolkit-cluster in project fifth-curve-272318 in us-central1-a...
I0801 23:46:03.689273 4536327616 cli.py:186] please be patient, this may take several minutes
I0801 23:46:03.689368 4536327616 cli.py:188] visit https://console.cloud.google.com/kubernetes/clusters/details/us-central1-a/einsteintoolkit-cluster?project=fifth-curve-272318 to monitor cluster creation progress
W0801 23:46:06.863384 4536327616 http.py:123] Invalid JSON content from response: b'{\n  "error": {\n    "code": 403,\n    "message": "Insufficient regional quota to satisfy request: resource \\"IN_USE_ADDRESSES\\": request requires \'12.0\' and is short \'4.0\'. project has a quota of \'8.0\' with \'8.0\' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=fifth-curve-272318.",\n    "status": "PERMISSION_DENIED"\n  }\n}\n'
E0801 23:46:06.868414 4536327616 util.py:68] exception in call <function Cluster.create at 0x7feba0b13dd0>:
<HttpError 403 when requesting https://container.googleapis.com/v1beta1/projects/fifth-curve-272318/zones/us-central1-a/clusters?alt=json returned "Insufficient regional quota to satisfy request: resource "IN_USE_ADDRESSES": request requires '12.0' and is short '4.0'. project has a quota of '8.0' with '8.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=fifth-curve-272318.">
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/caliban-0.3.0+8.gaf9dd99.dirty-py3.7.egg/caliban/platform/gke/util.py", line 65, in wrapper
    response = fn(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/caliban-0.3.0+8.gaf9dd99.dirty-py3.7.egg/caliban/platform/gke/cluster.py", line 1178, in create
    rsp = request.execute()
  File "/opt/anaconda3/lib/python3.7/site-packages/google_api_python_client-1.10.0-py3.7.egg/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/google_api_python_client-1.10.0-py3.7.egg/googleapiclient/http.py", line 907, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 403 when requesting https://container.googleapis.com/v1beta1/projects/fifth-curve-272318/zones/us-central1-a/clusters?alt=json returned "Insufficient regional quota to satisfy request: resource "IN_USE_ADDRESSES": request requires '12.0' and is short '4.0'. project has a quota of '8.0' with '8.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=fifth-curve-272318.">

Free trial accounts cannot update their quota. Is there a way to request fewer IP addresses?

ModuleNotFoundError: No module named 'google'

I followed the instructions to install Caliban but running
caliban run --experiment_config config.json test.py
gives the following error:

File "/.resources/caliban_launcher.py", line 23, in
import google.auth
ModuleNotFoundError: No module named 'google'

I also checked and installed pip install google

update: after including google-cloud-storage in requirement.txt problem is solved

Provide Base Docker Image for CUDA 11

It would be convenient to have a Base Docker Image for Ubuntu 20.04 and CUDA 11.1, e.g. based on nvidia/cuda:11.1-devel-ubuntu20.04.

Caliban should fail more gracefully when Docker isn't available

Missing newlines in generated Dockerfile when using GCP credentials

I started setting up GCP credentials etc., and am now seeing this error with caliban shell:

$ caliban shell --nogpu --docker_run_args '--volume /Users/eschnett/caliban-simulations:/caliban-simulations'
I0802 13:31:15.145786 4626697664 build.py:645] Running command: docker build --rm -f- /Users/eschnett/src/CarpetX

[...]

Step 8/8 : COPY --chown=501:20 .caliban_adc_creds.json /home/eschnett/.config/gcloud/application_default_credentials.jsonCOPY --chown=501:20 cloud_sql_proxy.py /.resources
COPY failed: stat /var/lib/docker/tmp/docker-builder529025309/home/eschnett/.config/gcloud/application_default_credentials.jsonCOPY: no such file or directory
E0802 13:31:33.397300 4626697664 main.py:165] Docker failed with error code 1.
E0802 13:31:33.397594 4626697664 main.py:166] Original command: docker build --rm -f- /Users/eschnett/src/CarpetX

The Dockerfile is missing a newline, so that two successive COPY statements run together.

Convert Caliban to Vertex AI from Cloud AI Platform

[JOSS review] community guidelines

For this check-list item:

Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

No. 1 is already in place, but I don't see clear guidelines for no. 2 and 3.

KeyError: 'apt_packages' with latest commit

The latest commit seems to have introduce an error into handling configuration files. I have a project that does not have a .calibanconfig.json file, and caliban shell now reports an error:

(base) MAC0008052:DDF.jl erikschnetter$ caliban shell --nogpu
Traceback (most recent call last):
  File "/Users/erikschnetter/opt/anaconda3/bin/caliban", line 11, in <module>
    load_entry_point('caliban==0.2.6+9.gf59ace0', 'console_scripts', 'caliban')()
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/caliban-0.2.6+9.gf59ace0-py3.7.egg/caliban/main.py", line 164, in main
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/caliban-0.2.6+9.gf59ace0-py3.7.egg/caliban/main.py", line 74, in run_app
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/caliban-0.2.6+9.gf59ace0-py3.7.egg/caliban/platform/shell.py", line 113, in run_interactive
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/caliban-0.2.6+9.gf59ace0-py3.7.egg/caliban/platform/run.py", line 295, in run
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/caliban-0.2.6+9.gf59ace0-py3.7.egg/caliban/docker/build.py", line 628, in build_image
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/caliban-0.2.6+9.gf59ace0-py3.7.egg/caliban/docker/build.py", line 583, in _dockerfile_template
  File "/Users/erikschnetter/opt/anaconda3/lib/python3.7/site-packages/caliban-0.2.6+9.gf59ace0-py3.7.egg/caliban/config/__init__.py", line 164, in apt_packages
KeyError: 'apt_packages'

I believe the problem is that caliban_config creates an empty dictionary, and apt_packages later tries to access the key "apt_packages" in there.

Google-auth is not installed automatically

In order to run jobs on GCP, google-auth needs to be installed within the caliban venv. It seems like this is not the case, and unless google-auth is specified in the user's requirements.txt (or something that requires google-auth is specified in requirements.txt), jobs are unable to launch.

Abrupt exit when training a model

Hi all,

Thanks for putting this package up! I really love the idea behind it and can't wait to integrate it more tightly with my workflow!

I'm trying to integrate Caliban with one of my smaller projects I'm working on here, but I'm having some trouble getting things to run. I added the requirements.txt file as instructed, but when I run the training script, I don't see any visible error and the process exits abruptly.

I'm using a Mac, and my data is stored at /Users/dilip.thiagarajan/data. Here's exactly what I did:

In that repository, I first tried running:

caliban run --nogpu --docker_run_args "--volume /Users/dilip.thiagarajan/data:/data" train.py -- --model_name resnet18 --projection_dim 64 --fast_dev_run True --download --data_dir /data

When I run this from the terminal, I see the following output:

dilip.thiagarajan simclr_pytorch % caliban run --nogpu --docker_run_args "--volume /Users/dilip.thiagarajan/data:/data" train.py -- --model_name resnet18 --projection_dim 64 --fast_dev_run True --download --data_dir /data                    
I0624 22:07:53.246673 4578139584 docker.py:614] Running command: docker build --rm -f- /Users/dilip.thiagarajan/code/simclr_pytorch
Sending build context to Docker daemon  110.6kB

Step 1/11 : FROM gcr.io/blueshift-playground/blueshift:cpu
 ---> fafdb20241ad
Step 2/11 : RUN [ $(getent group 20) ] || groupadd --gid 20 20
 ---> Using cache
 ---> 6b724e6c1e38
Step 3/11 : RUN useradd --no-log-init --no-create-home -u 502 -g 20 --shell /bin/bash dilip.thiagarajan
 ---> Using cache
 ---> 251bdcb68ec9
Step 4/11 : RUN mkdir -m 777 /usr/app /.creds /home/dilip.thiagarajan
 ---> Using cache
 ---> d2952e2052e3
Step 5/11 : ENV HOME=/home/dilip.thiagarajan
 ---> Using cache
 ---> d8c700640045
Step 6/11 : WORKDIR /usr/app
 ---> Using cache
 ---> 8d6fd0c9f3f4
Step 7/11 : USER 502:20
 ---> Using cache
 ---> 293fcdb3733f
Step 8/11 : COPY --chown=502:20 requirements.txt /usr/app
 ---> Using cache
 ---> 9074b050a5de
Step 9/11 : RUN /bin/bash -c "pip install --no-cache-dir -r requirements.txt"
 ---> Using cache
 ---> 60f28d41deb9
Step 10/11 : COPY --chown=502:20 . /usr/app/.
 ---> 74b6d6b6d42f
Step 11/11 : ENTRYPOINT ["python", "train.py"]
 ---> Running in 54a219fe9826
Removing intermediate container 54a219fe9826
 ---> 081b2c362108
Successfully built 081b2c362108
I0624 22:07:54.054889 4578139584 util.py:710] Restoring pure python logging
I0624 22:07:54.057392 4578139584 docker.py:707]                                                                                                                                                
I0624 22:07:54.057760 4578139584 docker.py:708] Job 1 - Experiment args: ['--model_name', 'resnet18', '--projection_dim', '64', '--fast_dev_run', 'True', '--download', '--data_dir', '/data'] 
I0624 22:07:54.057989 4578139584 docker.py:787] Running command: docker run --ipc host --volume /Users/dilip.thiagarajan/data:/data 081b2c362108 --model_name resnet18 --projection_dim 64 --fast_dev_run True --download --data_dir /data
Executing:   0%|                                                                                                                                                 | 0/1 [00:00<?, ?experiment/s]Downloading: "https://download.pytorch.org/models/resnet18-5c106cde.pth" to /home/dilip.thiagarajan/.cache/torch/checkpoints/resnet18-5c106cde.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 52.8MB/s]
Running in fast_dev_run mode: will run a full train, val and test loop using a single batch
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
/opt/conda/envs/caliban/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:25: RuntimeWarning: You have defined a `val_dataloader()` and have defined a `validation_step()`, you may also want to define `validation_epoch_end()` for accumulating stats.
  warnings.warn(*args, **kwargs)

  | Name            | Type            | Params
----------------------------------------------------
0 | model           | Sequential      | 11 M  
1 | projection_head | Linear          | 32 K  
2 | loss            | NTXEntCriterion | 0     
Files already downloaded and verified                                                                                                                                                          
Files already downloaded and verified                                                                                                                                                          
Training: 0it [00:00, ?it/s]                                                                                                                                                                   
Training:   0%|          | 0/2 [00:00<?, ?it/s]                                                                                                                                                
E0624 22:08:09.984529 4578139584 docker.py:747] Job 1 failed with return code 137.                                                                                                             
E0624 22:08:09.984878 4578139584 docker.py:750] Failing args for job 1: ['--model_name', 'resnet18', '--projection_dim', '64', '--fast_dev_run', 'True', '--download', '--data_dir', '/data']  
Executing: 100%|#########################################################################################################################################| 1/1 [00:15<00:00, 15.93s/experiment]

while when I output to log by doing

caliban run --nogpu --docker_run_args "--volume /Users/dilip.thiagarajan/data:/data" train.py -- --model_name resnet18 --projection_dim 64 --fast_dev_run True --download --data_dir /data &> caliban_run.log &

I see the following in my trace:

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/dilip.thiagarajan/.pyenv/versions/3.7.3/lib/python3.7/logging/__init__.py", line 2039, in shutdown
    h.close()
  File "/Users/dilip.thiagarajan/.pyenv/versions/3.7.3/lib/python3.7/site-packages/absl/logging/__init__.py", line 864, in close
    self.stream.close()
AttributeError: 'TqdmFile' object has no attribute 'close'

Is this a problem with some interaction with logging and tqdm? Or is it something I'm doing that's incorrect when I'm mounting my data directory?

The following works properly for me locally:
python3 train.py --model_name resnet18 --projection_dim 64 --fast_dev_run True --data_dir ~/data --download

Thanks for your help!

HTTP Error 403: Forbidden

Hi
I was working with caliban for a while but just now when running exactly the same code I am getting 403 error for downloading MNIST dataset:

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to datasets/MNIST/raw/train-images-idx3-ubyte.gz 0it [00:00, ?it/s]Traceback (most recent call last): File "mlp.py", line 257, in <module> main() File "mlp.py", line 187, in main train_dataset = load_data('train', args.dataset, args.datadir, nchannels) File "mlp.py", line 91, in load_data dataset = get_dataset(root=datadir, train=True, download=True, transform=tr_transform) File "/opt/conda/envs/caliban/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 79, in __init__ self.download() File "/opt/conda/envs/caliban/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 146, in download download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5) File "/opt/conda/envs/caliban/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 256, in download_and_extract_archive download_url(url, download_root, filename, md5) File "/opt/conda/envs/caliban/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 84, in download_url raise e File "/opt/conda/envs/caliban/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 72, in download_url reporthook=gen_bar_updater() File "/opt/conda/envs/caliban/lib/python3.7/urllib/request.py", line 247, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: File "/opt/conda/envs/caliban/lib/python3.7/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/opt/conda/envs/caliban/lib/python3.7/urllib/request.py", line 531, in open response = meth(req, response) File "/opt/conda/envs/caliban/lib/python3.7/urllib/request.py", line 641, in http_response 'http', request, response, code, msg, hdrs) File "/opt/conda/envs/caliban/lib/python3.7/urllib/request.py", line 569, in error return self._call_chain(*args) File "/opt/conda/envs/caliban/lib/python3.7/urllib/request.py", line 503, in _call_chain result = func(*args) File "/opt/conda/envs/caliban/lib/python3.7/urllib/request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden

Cannot create cluster

I want to create a GKE cluster following your instructions. I have set up the cloud tools, authentication, etc. I receive this error message:

$ caliban cluster create --cluster_name einsteintoolkit-cluster --zone us-east1-b
I0801 23:07:37.349657 4416302528 cli.py:185] creating cluster einsteintoolkit-cluster in project fifth-curve-272318 in us-east1-b...
I0801 23:07:37.349900 4416302528 cli.py:186] please be patient, this may take several minutes
I0801 23:07:37.349989 4416302528 cli.py:188] visit https://console.cloud.google.com/kubernetes/clusters/details/us-east1-b/einsteintoolkit-cluster?project=fifth-curve-272318 to monitor cluster creation progress
E0801 23:07:37.582320 4416302528 util.py:68] exception in call <function Cluster.create at 0x7ffc08ba4a70>:
<HttpError 400 when requesting https://container.googleapis.com/v1beta1/projects/fifth-curve-272318/zones/us-east1-b/clusters?alt=json returned "Resource_limit.maximum must be greater than 0.">
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/caliban-0.3.0+8.gaf9dd99-py3.7.egg/caliban/platform/gke/util.py", line 65, in wrapper
    response = fn(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/caliban-0.3.0+8.gaf9dd99-py3.7.egg/caliban/platform/gke/cluster.py", line 1178, in create
    rsp = request.execute()
  File "/opt/anaconda3/lib/python3.7/site-packages/google_api_python_client-1.10.0-py3.7.egg/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/google_api_python_client-1.10.0-py3.7.egg/googleapiclient/http.py", line 907, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://container.googleapis.com/v1beta1/projects/fifth-curve-272318/zones/us-east1-b/clusters?alt=json returned "Resource_limit.maximum must be greater than 0.">

When I use --dry_run, I see these details:

$ caliban cluster create --cluster_name einsteintoolkit-cluster --zone us-east1-b --dry_run
I0801 23:08:27.893903 4585823680 cli.py:175] request:
{'cluster': {'autoscaling': {'autoprovisioningNodePoolDefaults': {'oauthScopes': ['https://www.googleapis.com/auth/compute',
                                                                                  'https://www.googleapis.com/auth/cloud-platform']},
                             'enableNodeAutoprovisioning': 'true',
                             'resourceLimits': [{'maximum': '24',
                                                 'resourceType': 'cpu'},
                                                {'maximum': '1536',
                                                 'resourceType': 'memory'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-k80'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-p100'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-v100'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-p4'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-t4'},
                                                {'maximum': '0',
                                                 'resourceType': 'nvidia-tesla-a100'}]},
             'enable_tpu': 'true',
             'ipAllocationPolicy': {'useIpAliases': 'true'},
             'locations': ['us-east1-b', 'us-east1-c', 'us-east1-d'],
             'name': 'einsteintoolkit-cluster',
             'nodePools': [{'config': {'oauthScopes': ['https://www.googleapis.com/auth/devstorage.read_only',
                                                       'https://www.googleapis.com/auth/logging.write',
                                                       'https://www.googleapis.com/auth/monitoring',
                                                       'https://www.googleapis.com/auth/service.management.readonly',
                                                       'https://www.googleapis.com/auth/servicecontrol',
                                                       'https://www.googleapis.com/auth/trace.append']},
                            'initialNodeCount': '3',
                            'name': 'default-pool'}],
             'releaseChannel': {'channel': 'REGULAR'},
             'zone': 'us-east1-b'},
 'parent': 'projects/fifth-curve-272318/locations/us-east1-b'}

There is indeed a resource request with a maximum of 0.

I am using the current master branch.

A way to provide my own docker image?

Is there a way to provide a Dockerfile or reference to a public docker image instead of relying on requirements.txt etc ?

Context: I'm working on a repo2docker Action that integrates with GitHub and want to explore automatically launching notebooks from a repo.

Google Cloud seems like a promising place to start! P.S. I found this tool through @rand

Create base image based on Ubuntu 20.04 LTS

I need to use several software packages that are either not available in Ubuntu 18.04, or are outdated there. It would be quite convenient for me if the base image was using Ubuntu 20.04LTS instead of 18.04LTS.

For reference, I need these packages which are not available in 18.04:

cmake >= 3.14
gcc >= 8
libopenmpi-dev >= 3
libpsm2

I can, of course, install these from source, but this is time consuming every time the image is rebuilt.

`caliban cloud`: providing project ID through CLI fails

I am seeing:

$caliban cloud --nogpu mnist.py -- --project_id landscape-238422


No project_id found. 'caliban cloud' requires that you either set a 
$PROJECT_ID environment variable with the ID of your Cloud project, or pass one 
explicitly via --project_id. Try again, please!

Setting the environment variable does seem to work.

Looking for a strangely-named image

Hello! Thanks again for this great project.

Running on a mac laptop, having installed caliban with pip, I tried the following:

$ caliban notebook --nogpu
I0204 10:58:44.096491 4624354752 build.py:731] Running command: docker build --rm -f- /Users/arokem/tmp/caliban_test
/Users/arokem/miniconda3/envs/caliban/lib/python3.8/subprocess.py:838: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  self.stdin = io.open(p2cwrite, 'wb', bufsize)
/Users/arokem/miniconda3/envs/caliban/lib/python3.8/subprocess.py:844: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  self.stdout = io.open(c2pread, 'rb', bufsize)
#1 [internal] load build definition from Dockerfile
#1 sha256:f28aea6def1a6cd99c3dba9a8588148c15a44dbfe43b481d262a0028d8dda975
#1 transferring dockerfile: 1.08kB 0.0s done
#1 DONE 0.0s

#2 [internal] load .dockerignore
#2 sha256:6dc01f396991620f90d338b881b4f6b1f6f0c9d6c01dfad942b60c6ef69159d7
#2 transferring context: 2B done
#2 DONE 0.0s

#3 [internal] load metadata for gcr.io/blueshift-playground/blueshift:cpu-ubuntu1804-py37
#3 sha256:25c05f5af803364da95cc3e34f6fd1b1fed810090a16c60858c3d711f73887c1
#3 DONE 0.3s

#4 [ 1/12] FROM gcr.io/blueshift-playground/blueshift:cpu-ubuntu1804-py37@sha256:ba03f280085e923a6be72af7d697cf209a17506cbe1963a927fc8633031c7fb0
#4 sha256:afa8668d82673191feeaa31b8753b5b616f1f9c9dbcb24c3e59516d900a74ef9
#4 DONE 0.0s

#10 [internal] load build context
#10 sha256:a35234ecc058bfe64aab295bb66df4d3bde95af86b97423e1400e05f86b78f8a
#10 transferring context: 248B done
#10 DONE 0.0s

#8 [ 5/12] WORKDIR /usr/app
#8 sha256:2018ebc767a5256774e0043b6c3a5b77d0fd9228fe9d1c25218eed5654b31f47
#8 CACHED

#5 [ 2/12] RUN [ $(getent group 20) ] || groupadd --gid 20 20
#5 sha256:83a3c5bf3f1567625775b104b115a569214db23e58cea0b09f683c85dfd0c12e
#5 CACHED

#9 [ 6/12] RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install --yes --no-install-recommends zsh && apt-get clean && rm -rf /var/lib/apt/lists/*
#9 sha256:152bc68698ec58a809ee62324e32d4a31182d396774266aadc4a704749cd0aaa
#9 CACHED

#7 [ 4/12] RUN mkdir -m 777 /usr/app /.creds /.resources /home/arokem
#7 sha256:ace23be941af73ca9b62052532466d06c8ff2656ebf85fb5107f3680461b7771
#7 CACHED

#13 [ 9/12] RUN pip install jupyter
#13 sha256:80d93a11752a3233671d7bef53df010723558d69cf6f3e03fa2369dce1c11247
#13 CACHED

#6 [ 3/12] RUN useradd --no-log-init --no-create-home -u 501 -g 20 --shell /bin/bash arokem
#6 sha256:8c0e7c3001be7c1c7ad87536bcfc588350fc5ab55a04b4f62370cf55249a3548
#6 CACHED

#14 [10/12] COPY --chown=501:20 cloud_sql_proxy.py /.resources
#14 sha256:10d002a7044af1e1cea195fed2a0ac90c7468733f5433146334f03f3097899d0
#14 CACHED

#11 [ 7/12] COPY --chown=501:20 requirements.txt /usr/app
#11 sha256:3d1c7cfc1fa38374f278e323ba6f19471d578bc0faaf13eb5f09cb6a28c809e6
#11 CACHED

#12 [ 8/12] RUN /bin/bash -c "pip install --no-cache-dir -r requirements.txt"
#12 sha256:ea84f2faf52c42eca47379f73b7718c2aaa561a1c60b21cedf5c97d0bdc5e6cb
#12 CACHED

#15 [11/12] COPY --chown=501:20 caliban_launcher.py /.resources
#15 sha256:13f784a9b4ae0df920d8e9b1a22d410508e4bae77af6693cd79d11e3d041a330
#15 CACHED

#16 [12/12] COPY --chown=501:20 ./caliban_launcher_cfg.json /.resources
#16 sha256:c9c4c3a49dee7f24c33b75dd28282056cfb21bb64885626da62e4bb90c55cf3b
#16 CACHED

#17 exporting to image
#17 sha256:e8c613e07b0b7ff33893b694f7759a10d42e180f2b4dc349fb57dc6b71dcab00
#17 exporting layers done
#17 writing image sha256:a06b3ff98fd03ae55ff165572f5a29559ebc8d16b0e78cf46986801d573cbdc8 done
#17 DONE 0.0s
I0204 10:58:44.979649 4624354752 run.py:335] Running command: docker run --ipc host -w /usr/app -u 501:20 -v /Users/arokem/tmp/caliban_test:/usr/app -it --entrypoint python -v /Users/arokem:/home/arokem -p 8889:8889 0.0s -m jupyter notebook --ip=0.0.0.0 --port=8889 --no-browser
Unable to find image '0.0s:latest' locally
docker: Error response from daemon: pull access denied for 0.0s, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.

Looks like maybe the docker command is malformed? Why is that 0.0s in there?

Timeout when submitting request to deploy job

I'm running caliban to deploy a model to Google Cloud as follows:

caliban cloud --project 'taglia' --region 'us-central1' --machine_type 'n1-standa
rd-32' --gpu_spec '4xV100' train_panoptic.py -- <TRAIN SCRIPT ARGS>

I am getting stuck with the following error:

I0708 16:33:54.308651 139689314522944 core.py:324] Submitting request!                                                   
W0708 16:33:54.516765 139689314522944 http.py:171] Sleeping 1.11 seconds before retry 1 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:33:55.784710 139689314522944 http.py:171] Sleeping 2.46 seconds before retry 2 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:33:58.420140 139689314522944 http.py:171] Sleeping 5.17 seconds before retry 3 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:34:03.754786 139689314522944 http.py:171] Sleeping 1.37 seconds before retry 4 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:34:05.324873 139689314522944 http.py:171] Sleeping 5.63 seconds before retry 5 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:34:11.127667 139689314522944 http.py:171] Sleeping 13.70 seconds before retry 6 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:34:25.041211 139689314522944 http.py:171] Sleeping 124.27 seconds before retry 7 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:36:29.560709 139689314522944 http.py:171] Sleeping 79.67 seconds before retry 8 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:37:49.466338 139689314522944 http.py:171] Sleeping 351.15 seconds before retry 9 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
W0708 16:43:40.860112 139689314522944 http.py:171] Sleeping 656.62 seconds before retry 10 of 10 for request: POST https://ml.googleapis.com/v1/projects/taglia/jobs?alt=json, after 429
Submitting caliban_franciswi_google_com_1:   0%|                                             | 0/1 [09:46<?, ?requests/s]

I'm not sure what to do to resolve this.

Thanks!

google / caliban Goto Github PK

caliban's Issues

Recommend Projects

Recommend Topics

Recommend Org