Comments (9)
When run the train codes,i use 2 gpus,the problem occur as follow:
**num classes: 15
2021-11-09 21:32:31 epoch 20/353, processed 291080 samples, lr 0.000333
291144: nGT 155, recall 127, proposals 324, loss: x 4.715607, y 5.864799, w 4.005206, h 3.525136, conf 130.468964, cls 392.400330, class_contrast 1.089527, total 542.069580
291208: nGT 144, recall 129, proposals 356, loss: x 3.931613, y 5.137510, w 6.525736, h 2.192330, conf 89.379707, cls 200.923706, class_contrast 1.220589, total 309.311188
Traceback (most recent call last):
File "tool/train_decoupling_disturbance.py", line 403, in
train(epoch,repeat_time,mask_ratio)
File "tool/train_decoupling_disturbance.py", line 280, in train
output, dynamic_weights = model(data, metax_disturbance, mask_disturbance)
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/dio/VSST/zwm/YOLO_Meta/CME-main/tool/darknet/darknet_decoupling.py", line 223, in forward
x = self.detect_forward(x, dynamic_weights)
File "/home/dio/VSST/zwm/YOLO_Meta/CME-main/tool/darknet/darknet_decoupling.py", line 175, in detect_forward
x = self.models[ind]((x, dynamic_weights[dynamic_cnt]))
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, kwargs)
File "/home/dio/VSST/zwm/YOLO_Meta/CME-main/core/dynamic_conv.py", line 163, in forward
self.padding, self.dilation, groups)
RuntimeError: CUDA out of memory. Tried to allocate 422.00 MiB (GPU 0; 10.76 GiB total capacity; 9.72 GiB already allocated; 179.69 MiB free; 84.55 MiB cached)
if i need to use 4 gpus to train?
from cme.
In experiment, I used 2 GPUs to train. You can use more GPU to train or reduce the batch size, but it may affect the result.
from cme.
OK,thank you very much!
from cme.
In experiment, I used 2 GPUs to train. You can use more GPU to train or reduce the batch size, but it may affect the result.
I used 4 1080Tis and reduced the batch size from 64 to 32 during fine-tuning, the result is not as good as the result paper reported. The mAP on VOC split1 is only 0.385 and it is 0.475 in paper. @Bohao-Lee
from cme.
In experiment, I used 2 GPUs to train. You can use more GPU to train or reduce the batch size, but it may affect the result.
I used 4 1080Tis and reduced the batch size from 64 to 32 during fine-tuning, the result is not as good as the result paper reported. The mAP on VOC split1 is only 0.385 and it is 0.475 in paper. @Bohao-Lee
I have not tried on the reduced batch size setting before, but I can reproduce the performance on two 3080 GPUs.
from cme.
In experiment, I used 2 GPUs to train. You can use more GPU to train or reduce the batch size, but it may affect the result.
I used 4 1080Tis and reduced the batch size from 64 to 32 during fine-tuning, the result is not as good as the result paper reported. The mAP on VOC split1 is only 0.385 and it is 0.475 in paper. @Bohao-Lee
I have not tried on the reduced batch size setting before, but I can reproduce the performance on two 3080 GPUs.
Thanks, I will try again
from cme.
I also encountered this error:RuntimeError: CUDA out of memory. Tried to allocate 422.00 MiB (GPU 0; 10.76 GiB total capacity; 9.72 GiB already allocated; 179.69 MiB free; 84.55 MiB cached). but I only have two 2080ti, so what should I do? Reduce the batch_size? @xiaofeng-c @Bohao-Lee
from cme.
I also encountered this error:RuntimeError: CUDA out of memory. Tried to allocate 422.00 MiB (GPU 0; 10.76 GiB total capacity; 9.72 GiB already allocated; 179.69 MiB free; 84.55 MiB cached). but I only have two 2080ti, so what should I do? Reduce the batch_size? @xiaofeng-c @Bohao-Lee
Maybe reducing the batch size can help you. But it may affect performance. @Jxt5671
from cme.
I use two 3080 GPUs,butI also encountered this error:RuntimeError: CUDA out of memory. Tried to allocate 422.00 MiB,Could you please tell me the CUDA, torch and python versions you use?
Besides,I have also tried to reduce the batchsize to 32. There is no problem with basetrain, but there is still a CUDA out of memory problem in finetuning,So should I reduce the batch size to 16?@Bohao-Lee
from cme.
Related Issues (20)
- Questions about model size HOT 1
- Questions about nGPU
- Hi Can you provide the t-SNE code? HOT 2
- 能否共享一下t-SNE的代码?并分享一下如何使用。感谢 HOT 2
- Some question about MPSR baseline HOT 4
- About the process datasets
- Feature disturbance applied on both base&novel? HOT 1
- Question about .weights HOT 2
- About the category and confidence after the test
- 前辈 检测结果图中显示类别求指教!
- How long is the model trained by using two Nvidia Tesla V100 GPUs as mentioned in the paper and the github issue?
- occur the NAN when training the Net using the pascal voc HOT 2
- How to change the framework from Yolo to Faster R-CNN ? HOT 1
- Where is the code for 'Feature disturbance'
- RuntimeError: shape ‘[30, 5, 6, 13, 13]’ is invalid for input of size 5070 HOT 3
- ModuleNotFoundError: No module named 'core'
- RuntimeError: shape '[64, 15, 845]' is invalid for input of size 1622400 HOT 1
- As the training time increases, Proposals=0, is this normal?
- Problem:Get result If you want to get the result of model, run:
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cme.