The dreamboothonray from gjoliver

Performance issues

Hi!

I am trying to scale a Dreambooth training using ray with your script, however, there are some issues that I would like to discuss.
First, I cannot modify the batch size from 1, if I change it to 2, here is the error:

File "train.py", line 149, in train_fn
latents = vae.encode(batch["images"]).latent_dist.sample() * 0.18215
File "/home/ubuntu/ray-sample/ray/python/ray/air/examples/dreambooth/venv/lib/python3.8/site-packages/diffusers/models/vae.py", line 566, in encode
h = self.encoder(x)
File "/home/ubuntu/ray-sample/ray/python/ray/air/examples/dreambooth/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/ray-sample/ray/python/ray/air/examples/dreambooth/venv/lib/python3.8/site-packages/diffusers/models/vae.py", line 130, in forward
sample = self.conv_in(sample)
File "/home/ubuntu/ray-sample/ray/python/ray/air/examples/dreambooth/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/ray-sample/ray/python/ray/air/examples/dreambooth/venv/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/ubuntu/ray-sample/ray/python/ray/air/examples/dreambooth/venv/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [2, 2, 3, 512, 512]

Second, why do you split the model between 2 GPUs, I am using the same g5.12xlarge instances and I am able to load everything on 1 GPU.
Third, performance, I am seeing a huge performance impact when I run it between different nodes, here are some results:

1 worker 1 instance 1 GPU -> 1.8 it/s
4 worker 1 instance 4 GPU -> 3.5 it/s
2 worker 2 instances 2 GPU -> 0.5 it/s

From my debugging, I see that loss.backward() is taking 9 seconds to complete when I run the 2 instance configuration, with 4 workers 1 instance it takes 0.3 seconds and with 1 worker it is negligible.

Also when running across instances, internet usage spikes on those instances to over 500 MB/s up and down.

What could be the root issue of this extremely poor performance?

Thank you!!

gjoliver / dreamboothonray Goto Github PK

dreamboothonray's People

Contributors

Stargazers

Watchers

Forkers

dreamboothonray's Issues

Performance issues

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent