Comments (7)
Thanks for the report, @dthiagarajan ! It looks like there are two issues occuring here.
The first, simple one is that you've discovered a bug in the way we were handling the stdout
dance when we shell out to build the docker container. I wasn't explicitly flushing stdout
, which caused the build steps to appear at the end of your caliban_run.log
. I also, as you can see, needed to implement a close()
method on TqdmFile
. That's all covered in #30 , and we should have a new release out today.
But that's not what's causing the problem in your training job. Looking around a bit it seems that "return code 137" is Docker's way of signaling that it's run out of memory (moby/moby#21083, as an example).
I think this may be a Mac-only problem, and solvable this way:
"Repeating what's said above, this happens on OSX because of Docker 4 Mac's hard memory cap. You can increase your memory limit in Docker App > Preferences > Advanced."
On a Mac, you can click the "Docker Desktop" menu in the menu bar, click "Preferences" and increase the available memory in the "Resources" tab:
I think this is going to be the cleanest solution. I'll poke around and see if there is some setting we can enable by default that will allow Docker to access more memory, or at least catch this error and make it clearer to the user what's going on.
Please let me know if this helps and gets you unblocked! Thanks again for the report, @dthiagarajan , and for testing out Caliban.
from caliban.
This was a world-class bug report, by the way! Thanks for the care it took to write.)
from caliban.
@dthiagarajan The details here are:
tqdm
uses carriage returns, like\r
, to rewrite the current line. Python doesn't pass those through without some work, when you're running another python job in a subprocess.- Python buffers its output, which is a mess here, because
tqdm
uses bothstdout
andstderr
to write its outputs. - Docker doesn't have a
COLUMNS
orLINES
variable internally when you run a container in non-interactive mode!
#31 tackles each of these. It's not perfect — I suspect if you nest progress bars, you may run into trouble, but maybe not. If you have a tqdm
process and write a bunch of output inside the loop, that might trigger a newline as well.
But this solves most of the issues we'd seen, and I think you'll be happier with the result for sure.
from caliban.
Ah, I hadn't noticed that error code - that seems to fix the memory issue, thanks!
On another note (and more nitpicky), I'm seeing something like the following with the tqdm
progress bar updating when running with caliban:
Training: 0it [00:00, ?it/s]
Training: 0%| | 0/2 [00:00<?, ?it/s]
Epoch 1: 0%| | 0/2 [00:00<?, ?it/s]
Epoch 1: 50%|█████ | 1/2 [00:03<00:03, 3.29s/it]
Epoch 1: 50%|█████ | 1/2 [00:03<00:03, 3.29s/it, loss=3.435, v_num=5]
Epoch 1: 100%|██████████| 2/2 [00:04<00:00, 2.22s/it, loss=3.435, v_num=5]
Epoch 1: 100%|██████████| 2/2 [00:04<00:00, 2.27s/it, loss=3.435, v_num=5]
Executing: 0%| | 0/1 [00:11<?, ?experiment/s]
whereas when I run locally, I see the following:
Epoch 1: 100%|██████████████████████████████████████████████████████████████| 2/2 [00:20<00:00, 10.12s/it, loss=3.306, v_num=5]
Do I need to specify something when I'm logging in my script? I'm wondering why 1) the progress bar is much longer in the latter compared to the former and 2) why it's logging duplicates.
from caliban.
@dthiagarajan , I knew that this was a problem and I'd tried to fix it before and failed... but you've successfully motivated me to tackle the issue. Progress bars are too awesome to have to give up inside Caliban jobs. (Especially when I'm using tqdm
myself to show how many jobs you've completed!)
I've solved this problem in #31. Once I get this merged today, I'll release 0.2.6
and let you know here on this ticket.
Incidentally it makes our tutorial much prettier!
Thanks again for the nudge.
I0626 09:27:58.373739 4605930944 docker.py:708] Job 1 - Experiment args: []
I0626 09:27:58.374180 4605930944 docker.py:810] Running command: docker run --ipc host -ePYTHONUNBUFFERED=1 -e COLUMNS=254 -e LINES=25 d74bc27fcf48
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 2s 0us/step
2020-06-26 15:28:03.516755: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-26 15:28:03.522778: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2592000000 Hz
2020-06-26 15:28:03.524190: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b5bad7c3c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-26 15:28:03.524249: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Training model with learning rate=0.1 for 3 epochs.
Epoch 1/3
1875/1875 [==============================] - 4s 2ms/step - loss: 2.0499 - accuracy: 0.2110
Epoch 2/3
1875/1875 [==============================] - 5s 2ms/step - loss: 1.9334 - accuracy: 0.1909
Epoch 3/3
1875/1875 [==============================] - 6s 3ms/step - loss: 1.9116 - accuracy: 0.1931
Model performance:
313/313 - 0s - loss: 1.8834 - accuracy: 0.2098
I0626 09:28:20.245232 4605930944 docker.py:744] Job 1 succeeded!
Executing: 100%|########################################################################################################################################################################################################| 1/1 [00:21<00:00, 21.88s/experiment]
from caliban.
Okay @dthiagarajan , I've just a cut the 0.2.6 release with these changes: https://github.com/google/caliban/releases/tag/0.2.6
The build should finish shortly and deploy this to pypi
. Upgrade with:
pip install -U caliban
and please let us know if this fixes the issue. I'm going to go ahead and close this now, but feel free to re-open if you run into trouble. Thank you!
from caliban.
Awesome work @sritchie, thank you so much!
from caliban.
Related Issues (20)
- HTTP Error 403: Forbidden HOT 3
- ModuleNotFoundError: No module named 'google' HOT 2
- Issue with caliban package with installing using pip HOT 3
- distirbuted training
- Upgrade to modern dependencies [project]
- [JOSS review] community guidelines HOT 2
- Feature request: support REES HOT 1
- "Failed to read the container uri ... Please make sure that CloudML Engine service account has access to it"
- `caliban cloud`: providing project ID through CLI fails HOT 1
- Create base image based on Ubuntu 20.04 LTS HOT 2
- A way to provide my own docker image? HOT 3
- Cannot create cluster HOT 5
- Insufficient quota in GCP free trial account HOT 2
- Missing newlines in generated Dockerfile when using GCP credentials HOT 1
- Docker image is rebuilt for every `cluster job submit` HOT 12
- Documentation: Caliban Default Creds HOT 8
- Google-auth is not installed automatically
- Make caliban fall back to cpu-only gracefully for local or shell commands HOT 1
- Provide Base Docker Image for CUDA 11
- Looking for a strangely-named image HOT 12
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from caliban.