continue with the previous issue: I have temporarily modified line45 of train.py t

OOM issue about metaoptnet HOT 15 CLOSED

kjunelee commented on June 16, 2024

OOM issue

from metaoptnet.

Comments (15)

kjunelee commented on June 16, 2024

To get an optimal performance, we need to set episodes_per_batch to 8, which requires 4 GPUs.
In this case,
network = torch.nn.DataParallel(network, device_ids=[0,1,2,3])
python train.py --gpu 0,1,2,3 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 15 --head SVM --network ResNet --dataset miniImageNet --eps 0.1

If we can afford only 1 GPU, we can set episodes_per_batch to 2, then it will work without an OOM error.
In this case,

(You may comment out this line) network = torch.nn.DataParallel(network, device_ids=[0])

python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 15 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes_per_batch 2

Let me know if you have any questions.

from metaoptnet.

ckcraig01 commented on June 16, 2024

Thanks. Base on your suggestion, I have comment out the very line.
And use command as:
python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 15 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 2
but still encountered OOM condition.

Then I found out I can successful run 1/2 episodes-per-batch with 1/2 gpus. (it could be I have less memory per gpu)
by setting:
network = torch.nn.DataParallel(network, device_ids=[0,1])
Thanks again for your detailed explanation.

from metaoptnet.

ckcraig01 commented on June 16, 2024

Hi Kwonjoon:

Sorry it's me again:

I was training with single gpu and --episodes-per-batch 1,

it is strange that I have gone through the first 1000 batch but OOM again at epoch 2.

Please let me know if more information is required. Many thanks.

~/MetaOptNet$ python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 15 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 1
Loading mini ImageNet dataset - phase train
Loading mini ImageNet dataset - phase val
('using gpu:', '0')
{'episodes_per_batch': 1, 'head': 'SVM', 'val_query': 15, 'test_way': 5, 'train_way': 5, 'eps': 0.1, 'save_epoch': 10, 'val_episode': 2000, 'num_epoch': 60, 'train_query': 6, 'save_path': './experiments/miniImageNet_MetaOptNet_SVM', 'train_shot': 15, 'val_shot': 5, 'gpu': '0', 'dataset': 'miniImageNet', 'network': 'ResNet'}
Train Epoch: 1 Learning Rate: 0.1000
10%|███████████▍ | 99/1000 [01:04<09:34, 1.57it/s]Train Epoch: 1 Batch: [100/1000] Loss: 1.4024 Accuracy: 46.83 % (56.67 %)
20%|██████████████████████▉ | 199/1000 [02:09<08:43, 1.53it/s]Train Epoch: 1 Batch: [200/1000] Loss: 1.4666 Accuracy: 46.93 % (46.67 %)
30%|██████████████████████████████████▍ | 299/1000 [03:13<07:46, 1.50it/s]Train Epoch: 1 Batch: [300/1000] Loss: 1.4943 Accuracy: 47.28 % (40.00 %)
40%|█████████████████████████████████████████████▉ | 399/1000 [04:22<06:36, 1.52it/s]Train Epoch: 1 Batch: [400/1000] Loss: 1.4684 Accuracy: 47.37 % (43.33 %)
50%|█████████████████████████████████████████████████████████▍ | 499/1000 [05:31<05:45, 1.45it/s]Train Epoch: 1 Batch: [500/1000] Loss: 1.5299 Accuracy: 47.83 % (33.33 %)
60%|████████████████████████████████████████████████████████████████████▉ | 599/1000 [06:39<04:32, 1.47it/s]Train Epoch: 1 Batch: [600/1000] Loss: 1.4971 Accuracy: 48.47 % (33.33 %)
70%|████████████████████████████████████████████████████████████████████████████████▍ | 699/1000 [07:48<03:26, 1.45it/s]Train Epoch: 1 Batch: [700/1000] Loss: 1.3542 Accuracy: 48.91 % (43.33 %)
80%|███████████████████████████████████████████████████████████████████████████████████████████▉ | 799/1000 [08:57<02:16, 1.47it/s]Train Epoch: 1 Batch: [800/1000] Loss: 1.2025 Accuracy: 49.25 % (60.00 %)
90%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 899/1000 [10:06<01:09, 1.45it/s]Train Epoch: 1 Batch: [900/1000] Loss: 1.4882 Accuracy: 49.54 % (33.33 %)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 999/1000 [11:15<00:00, 1.44it/s]Train Epoch: 1 Batch: [1000/1000] Loss: 1.1104 Accuracy: 49.76 % (66.67 %)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [11:16<00:00, 1.45it/s]
0%| | 1/2000 [00:00<16:29, 2.02it/s]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524577523076/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Exception KeyError: KeyError(<weakref at 0x7fa14aa80e10; to 'tqdm' at 0x7fa126bc1d10>,) in <bound method tqdm.del of 0%| | 1/2000 [00:01<16:29, 2.02it/s]> ignored
Traceback (most recent call last):
File "train.py", line 245, in
emb_query = embedding_net(data_query.reshape([-1] + list(data_query.shape[-3:])))
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/MetaOptNet/models/ResNet12_embedding.py", line 114, in forward
x = self.layer2(x)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx//MetaOptNet/models/ResNet12_embedding.py", line 56, in forward
out = self.relu(out)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/activation.py", line 447, in forward
return F.leaky_relu(input, self.negative_slope, self.inplace)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/functional.py", line 731, in leaky_relu
return torch._C._nn.leaky_relu(input, negative_slope)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524577523076/work/aten/src/THC/generic/THCStorage.cu:58

from metaoptnet.

kjunelee commented on June 16, 2024

It is weird.
Can you change try --train-shot 5?

from metaoptnet.

ckcraig01 commented on June 16, 2024

I tried --train-shot 5, the below is the log. The message seems different but the issue might be the same.

(metaopnet) x~/MetaOptNet$ python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 5 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 1
Loading mini ImageNet dataset - phase train
Loading mini ImageNet dataset - phase val
('using gpu:', '0')
{'episodes_per_batch': 1, 'head': 'SVM', 'val_query': 15, 'test_way': 5, 'train_way': 5, 'eps': 0.1, 'save_epoch': 10, 'val_episode': 2000, 'num_epoch': 60, 'train_query': 6, 'save_path': './experiments/miniImageNet_MetaOptNet_SVM', 'train_shot': 5, 'val_shot': 5, 'gpu': '0', 'dataset': 'miniImageNet', 'network': 'ResNet'}
Train Epoch: 1 Learning Rate: 0.1000
10%|███████████▍ | 99/1000 [00:28<04:20, 3.46it/s]Train Epoch: 1 Batch: [100/1000] Loss: 1.5324 Accuracy: 39.87 % (36.67 %)
20%|██████████████████████▉ | 199/1000 [00:57<03:59, 3.34it/s]Train Epoch: 1 Batch: [200/1000] Loss: 1.4797 Accuracy: 39.05 % (43.33 %)
30%|██████████████████████████████████▍ | 299/1000 [01:26<03:25, 3.41it/s]Train Epoch: 1 Batch: [300/1000] Loss: 1.3376 Accuracy: 39.41 % (53.33 %)
40%|█████████████████████████████████████████████▉ | 399/1000 [01:55<02:52, 3.48it/s]Train Epoch: 1 Batch: [400/1000] Loss: 1.3189 Accuracy: 39.65 % (53.33 %)
50%|█████████████████████████████████████████████████████████▍ | 499/1000 [02:24<02:25, 3.43it/s]Train Epoch: 1 Batch: [500/1000] Loss: 1.5739 Accuracy: 39.84 % (43.33 %)
60%|████████████████████████████████████████████████████████████████████▉ | 599/1000 [02:54<01:57, 3.41it/s]Train Epoch: 1 Batch: [600/1000] Loss: 1.2768 Accuracy: 40.34 % (53.33 %)
70%|████████████████████████████████████████████████████████████████████████████████▍ | 699/1000 [03:24<01:31, 3.29it/s]Train Epoch: 1 Batch: [700/1000] Loss: 1.6253 Accuracy: 40.80 % (30.00 %)
80%|███████████████████████████████████████████████████████████████████████████████████████████▉ | 799/1000 [03:55<01:01, 3.29it/s]Train Epoch: 1 Batch: [800/1000] Loss: 1.3110 Accuracy: 41.17 % (53.33 %)
90%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 899/1000 [04:26<00:31, 3.26it/s]Train Epoch: 1 Batch: [900/1000] Loss: 1.3345 Accuracy: 41.54 % (56.67 %)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 999/1000 [04:57<00:00, 3.16it/s]Train Epoch: 1 Batch: [1000/1000] Loss: 1.2660 Accuracy: 41.81 % (56.67 %)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [04:57<00:00, 3.20it/s]
0%| | 1/2000 [00:00<09:57, 3.35it/s]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524577523076/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Exception KeyError: KeyError(<weakref at 0x7f60e517ae10; to 'tqdm' at 0x7f60c12bbd10>,) in <bound method tqdm.del of 0%| | 1/2000 [00:00<09:57, 3.35it/s]> ignored
Traceback (most recent call last):
File "train.py", line 245, in
emb_query = embedding_net(data_query.reshape([-1] + list(data_query.shape[-3:])))
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/MetaOptNet/models/ResNet12_embedding.py", line 114, in forward
x = self.layer2(x)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/MetaOptNet/models/ResNet12_embedding.py", line 54, in forward
residual = self.downsample(x)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524577523076/work/aten/src/THC/generic/THCStorage.cu:58

from metaoptnet.

kjunelee commented on June 16, 2024

Hmm. This never happened to me. Can you try --head ProtoNet?

from metaoptnet.

ckcraig01 commented on June 16, 2024

Still have the same problem with this:
python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 5 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 1 --head ProtoNet

May I know how much memory does it need? (mine is 8G per gpu) (it is weird that the OOM happened at the 2nd epoch)
Additionally, I found that only 2GB in 8GB memory is used per gpu even I use 2 gpus and set --episodes-per-batch to 1.
In the mean time, this works:
python train.py --gpu 0,1 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 5 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 1
Is there any memory utilization upper bound setting in the codebase?

I have surveyed some links for your reference. In this link,
NVIDIA/FastPhotoStyle#11
mentioned: @z1412247644 Thanks for your interests. It looks like a GPU memory problem. Would you try to resize your input image first (smaller), in order to fit your GPU memory?

from metaoptnet.

kjunelee commented on June 16, 2024

The code was tested on Titan X GPUs which have 12GB RAM.

from metaoptnet.

ckcraig01 commented on June 16, 2024

I see. But as my previous observation. for batch = 1, gpu=2, the utilization is 2GB per gpu so there is a total 4GB of memory used.
I would like to know when you run the training process (8 batches~=2batches per GPU), what is the memory utilization per GPU?
Just wonder if there is a setting for memory utilization upper bound.

from metaoptnet.

kjunelee commented on June 16, 2024

In my case, when 8 batches are used, the utilization becomes near 12GB per GPU.

from metaoptnet.

MAT0RIX commented on June 16, 2024

problem solved.

from metaoptnet.

harry2636 commented on June 16, 2024

I underwent the same issue mentioned by @ckcraig01. (OOM in 2nd epoch.)

I solved this issue by upgrading the Pytorch to the latest version.

After I upgraded my Pytorch version from 0.4 to 1.1.0, there was no OOM anymore.
I think there are some memory optimization issues in older Pytorch versions.

from metaoptnet.

kjunelee commented on June 16, 2024

Thanks for sharing!

from metaoptnet.

WonderSeven commented on June 16, 2024

Hi, I just test in Pytorch1.2.0 on 2080Ti(11G) and I solve the OOM problem by decreasing the batch size to 1 batch per GPU. It seems that the original setting is not suitable for the GPUs with memory less than 12G.

from metaoptnet.

kjunelee commented on June 16, 2024

Sorry for the inconvenience. This code was tested on Titan X/Xp. We haven't tried the code on GPUs with 11GB memory.

from metaoptnet.

OOM issue about metaoptnet HOT 15 CLOSED

Comments (15)

(You may comment out this line) network = torch.nn.DataParallel(network, device_ids=[0])

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent