Hello, first of all, thank you very much for sharing the implementat

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Backbone gpu memory increasing after each batch about 4d-humans HOT 8 OPEN

shubham-goel commented on July 19, 2024

Backbone gpu memory increasing after each batch

from 4d-humans.

Comments (8)

jerrinbright commented on July 19, 2024 1

Yeah makes sense. I will get back to you if I find a soln. Thank you!

from 4d-humans.

DavidBoja commented on July 19, 2024

The PL output estimation of the memory allocation of the whole mode is:

637 M     Trainable params
0         Non-trainable params
637 M     Total params
2,549.510 Total estimated model params size (MB)

Which does make sense - and is around what gets allocated on the gpu (2,706 is the actual size) when training starts.

However, once the training actually starts (when training_step starts running) the size just keeps increasing in each iteration. I am still at a loss, what exactly is causing that. I checked if any there are any appends of non-detached tensors, I stopped the logging, and reduced everything else I could, but the issue still persists.

Do you have any advice?

from 4d-humans.

jerrinbright commented on July 19, 2024

Hi @DavidBoja, any luck finding the reason behind the spike? I am facing the same issue.

from 4d-humans.

DavidBoja commented on July 19, 2024

No, unfortunately I did not.

I switched to other architectures since the architecture from this paper needs a lot of compute power (gpu) and I believe it primarily achieves good results because of the huge amount of data it is trained on.

I wish you luck. Let me know if you manage to find a solution please :).

from 4d-humans.

mlkorra commented on July 19, 2024

Hi @DavidBoja ,are you working on 3D Human Reconstruction problem? Also which architecture have you been using currently?

from 4d-humans.

DavidBoja commented on July 19, 2024

Hi @mlkorra
I'm more focused on 3D data, rather than 2D data, but I'm interested in guided transformers like these, and non-learning NNs like these.

from 4d-humans.

geopavlakos commented on July 19, 2024

Not sure of the exact setting you are working with, but one aspect that we observed that could create issues with GPU memory is using values for the number of workers that might be more than what it's available at the machine that we train on. In that case, we decreased that value and we avoided the GPU memory increase issue.

from 4d-humans.

DavidBoja commented on July 19, 2024

Hi @geopavlakos,

Thanks for the help. I am working with two 12gb Nvidia cards. I tried lowering the number of workers but unfortunately this did not help.

I can run the demo successfully, but I face issues with the training. I tried lowering the number of workers, the batch size, and even played around with lowering the SMPL_HEAD depth and heads to only 2, but the issue still persists.
I think the number of workers should not be related to the issue I'm facing however (problem with gpu memory that keeps increasing), because (as I understand it) the workers prepare the dataset examples that are going to be batched in an training iteration - but they are only transferred on the gpu once the training loop starts.

On the other hand, I have never used pytorch lightning before, so maybe issues are arising from there.

In the meantime I have switched to other work so I'm not actively experimenting with the network, so maybe @jerriebright can share more input regarding the issue he is facing if he is still working on it - or if he has found a solution :).

from 4d-humans.

Backbone gpu memory increasing after each batch about 4d-humans HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent