Comments (8)
Yeah makes sense. I will get back to you if I find a soln. Thank you!
from 4d-humans.
The PL output estimation of the memory allocation of the whole mode is:
637 M Trainable params
0 Non-trainable params
637 M Total params
2,549.510 Total estimated model params size (MB)
Which does make sense - and is around what gets allocated on the gpu (2,706 is the actual size) when training starts.
However, once the training actually starts (when training_step
starts running) the size just keeps increasing in each iteration. I am still at a loss, what exactly is causing that. I checked if any there are any appends of non-detached tensors, I stopped the logging, and reduced everything else I could, but the issue still persists.
Do you have any advice?
from 4d-humans.
Hi @DavidBoja, any luck finding the reason behind the spike? I am facing the same issue.
from 4d-humans.
No, unfortunately I did not.
I switched to other architectures since the architecture from this paper needs a lot of compute power (gpu) and I believe it primarily achieves good results because of the huge amount of data it is trained on.
I wish you luck. Let me know if you manage to find a solution please :).
from 4d-humans.
Hi @DavidBoja ,are you working on 3D Human Reconstruction problem? Also which architecture have you been using currently?
from 4d-humans.
Hi @mlkorra
I'm more focused on 3D data, rather than 2D data, but I'm interested in guided transformers like these, and non-learning NNs like these.
from 4d-humans.
Not sure of the exact setting you are working with, but one aspect that we observed that could create issues with GPU memory is using values for the number of workers that might be more than what it's available at the machine that we train on. In that case, we decreased that value and we avoided the GPU memory increase issue.
from 4d-humans.
Hi @geopavlakos,
Thanks for the help. I am working with two 12gb Nvidia cards. I tried lowering the number of workers but unfortunately this did not help.
I can run the demo successfully, but I face issues with the training. I tried lowering the number of workers, the batch size, and even played around with lowering the SMPL_HEAD depth and heads to only 2, but the issue still persists.
I think the number of workers should not be related to the issue I'm facing however (problem with gpu memory that keeps increasing), because (as I understand it) the workers prepare the dataset examples that are going to be batched in an training iteration - but they are only transferred on the gpu once the training loop starts.
On the other hand, I have never used pytorch lightning before, so maybe issues are arising from there.
In the meantime I have switched to other work so I'm not actively experimenting with the network, so maybe @jerriebright can share more input regarding the issue he is facing if he is still working on it - or if he has found a solution :).
from 4d-humans.
Related Issues (20)
- About the learning rate HOT 1
- Multi gpu training HOT 1
- Visualization of the reconstructed human mesh HOT 2
- Unable to download hmr2_data.tar.gz HOT 2
- A colab error about tracking HOT 2
- The meaning of 'NUM_TRAIN_SAMPLES', 'NUM_TEST_SAMPLES' and the question of the discriminator.
- PHALP tracker taking lot of CPU utilization HOT 1
- Can not import expand_bbox_to_aspect_ratio from hmr2.datasets.utills HOT 1
- about the training data,do you use EFT fits or what? HOT 6
- how about using smpl 49 keypoints(like spin did) instead of your smpl 44 keypoints? will the result be worse? HOT 2
- Is it easy to train this work with 4 V100 24G or 8 3090 24G? (no A100) HOT 1
- Error of downloading training dataset HOT 1
- Is there any specific reason for choosing Detectron2 as human detection model? HOT 1
- Using 4 V100-16, and set batchsize=1, torch.cuda.OutOfMemoryError:
- Have you tried to use a smaller backbone? HOT 1
- use SMPLX model with hands and feet? HOT 2
- Pseudo labels generation
- evaluation datasets image HOT 3
- download problem of hmr2.0a_model.tar.gz HOT 4
- Training data preprocessing
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from 4d-humans.