Git Product home page Git Product logo

Comments (15)

kjunelee avatar kjunelee commented on June 16, 2024

To get an optimal performance, we need to set episodes_per_batch to 8, which requires 4 GPUs.
In this case,
network = torch.nn.DataParallel(network, device_ids=[0,1,2,3])
python train.py --gpu 0,1,2,3 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 15 --head SVM --network ResNet --dataset miniImageNet --eps 0.1

If we can afford only 1 GPU, we can set episodes_per_batch to 2, then it will work without an OOM error.
In this case,

(You may comment out this line) network = torch.nn.DataParallel(network, device_ids=[0])

python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 15 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes_per_batch 2

Let me know if you have any questions.

from metaoptnet.

ckcraig01 avatar ckcraig01 commented on June 16, 2024

Thanks. Base on your suggestion, I have comment out the very line.
And use command as:
python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 15 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 2
but still encountered OOM condition.

Then I found out I can successful run 1/2 episodes-per-batch with 1/2 gpus. (it could be I have less memory per gpu)
by setting:
network = torch.nn.DataParallel(network, device_ids=[0,1])
Thanks again for your detailed explanation.

from metaoptnet.

ckcraig01 avatar ckcraig01 commented on June 16, 2024

Hi Kwonjoon:

Sorry it's me again:

I was training with single gpu and --episodes-per-batch 1,

it is strange that I have gone through the first 1000 batch but OOM again at epoch 2.

Please let me know if more information is required. Many thanks.

~/MetaOptNet$ python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 15 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 1
Loading mini ImageNet dataset - phase train
Loading mini ImageNet dataset - phase val
('using gpu:', '0')
{'episodes_per_batch': 1, 'head': 'SVM', 'val_query': 15, 'test_way': 5, 'train_way': 5, 'eps': 0.1, 'save_epoch': 10, 'val_episode': 2000, 'num_epoch': 60, 'train_query': 6, 'save_path': './experiments/miniImageNet_MetaOptNet_SVM', 'train_shot': 15, 'val_shot': 5, 'gpu': '0', 'dataset': 'miniImageNet', 'network': 'ResNet'}
Train Epoch: 1 Learning Rate: 0.1000
10%|███████████▍ | 99/1000 [01:04<09:34, 1.57it/s]Train Epoch: 1 Batch: [100/1000] Loss: 1.4024 Accuracy: 46.83 % (56.67 %)
20%|██████████████████████▉ | 199/1000 [02:09<08:43, 1.53it/s]Train Epoch: 1 Batch: [200/1000] Loss: 1.4666 Accuracy: 46.93 % (46.67 %)
30%|██████████████████████████████████▍ | 299/1000 [03:13<07:46, 1.50it/s]Train Epoch: 1 Batch: [300/1000] Loss: 1.4943 Accuracy: 47.28 % (40.00 %)
40%|█████████████████████████████████████████████▉ | 399/1000 [04:22<06:36, 1.52it/s]Train Epoch: 1 Batch: [400/1000] Loss: 1.4684 Accuracy: 47.37 % (43.33 %)
50%|█████████████████████████████████████████████████████████▍ | 499/1000 [05:31<05:45, 1.45it/s]Train Epoch: 1 Batch: [500/1000] Loss: 1.5299 Accuracy: 47.83 % (33.33 %)
60%|████████████████████████████████████████████████████████████████████▉ | 599/1000 [06:39<04:32, 1.47it/s]Train Epoch: 1 Batch: [600/1000] Loss: 1.4971 Accuracy: 48.47 % (33.33 %)
70%|████████████████████████████████████████████████████████████████████████████████▍ | 699/1000 [07:48<03:26, 1.45it/s]Train Epoch: 1 Batch: [700/1000] Loss: 1.3542 Accuracy: 48.91 % (43.33 %)
80%|███████████████████████████████████████████████████████████████████████████████████████████▉ | 799/1000 [08:57<02:16, 1.47it/s]Train Epoch: 1 Batch: [800/1000] Loss: 1.2025 Accuracy: 49.25 % (60.00 %)
90%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 899/1000 [10:06<01:09, 1.45it/s]Train Epoch: 1 Batch: [900/1000] Loss: 1.4882 Accuracy: 49.54 % (33.33 %)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 999/1000 [11:15<00:00, 1.44it/s]Train Epoch: 1 Batch: [1000/1000] Loss: 1.1104 Accuracy: 49.76 % (66.67 %)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [11:16<00:00, 1.45it/s]
0%| | 1/2000 [00:00<16:29, 2.02it/s]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524577523076/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Exception KeyError: KeyError(<weakref at 0x7fa14aa80e10; to 'tqdm' at 0x7fa126bc1d10>,) in <bound method tqdm.del of 0%| | 1/2000 [00:01<16:29, 2.02it/s]> ignored
Traceback (most recent call last):
File "train.py", line 245, in
emb_query = embedding_net(data_query.reshape([-1] + list(data_query.shape[-3:])))
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/MetaOptNet/models/ResNet12_embedding.py", line 114, in forward
x = self.layer2(x)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx//MetaOptNet/models/ResNet12_embedding.py", line 56, in forward
out = self.relu(out)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/activation.py", line 447, in forward
return F.leaky_relu(input, self.negative_slope, self.inplace)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/functional.py", line 731, in leaky_relu
return torch._C._nn.leaky_relu(input, negative_slope)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524577523076/work/aten/src/THC/generic/THCStorage.cu:58

from metaoptnet.

kjunelee avatar kjunelee commented on June 16, 2024

It is weird.
Can you change try --train-shot 5?

from metaoptnet.

ckcraig01 avatar ckcraig01 commented on June 16, 2024

I tried --train-shot 5, the below is the log. The message seems different but the issue might be the same.

(metaopnet) x~/MetaOptNet$ python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 5 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 1
Loading mini ImageNet dataset - phase train
Loading mini ImageNet dataset - phase val
('using gpu:', '0')
{'episodes_per_batch': 1, 'head': 'SVM', 'val_query': 15, 'test_way': 5, 'train_way': 5, 'eps': 0.1, 'save_epoch': 10, 'val_episode': 2000, 'num_epoch': 60, 'train_query': 6, 'save_path': './experiments/miniImageNet_MetaOptNet_SVM', 'train_shot': 5, 'val_shot': 5, 'gpu': '0', 'dataset': 'miniImageNet', 'network': 'ResNet'}
Train Epoch: 1 Learning Rate: 0.1000
10%|███████████▍ | 99/1000 [00:28<04:20, 3.46it/s]Train Epoch: 1 Batch: [100/1000] Loss: 1.5324 Accuracy: 39.87 % (36.67 %)
20%|██████████████████████▉ | 199/1000 [00:57<03:59, 3.34it/s]Train Epoch: 1 Batch: [200/1000] Loss: 1.4797 Accuracy: 39.05 % (43.33 %)
30%|██████████████████████████████████▍ | 299/1000 [01:26<03:25, 3.41it/s]Train Epoch: 1 Batch: [300/1000] Loss: 1.3376 Accuracy: 39.41 % (53.33 %)
40%|█████████████████████████████████████████████▉ | 399/1000 [01:55<02:52, 3.48it/s]Train Epoch: 1 Batch: [400/1000] Loss: 1.3189 Accuracy: 39.65 % (53.33 %)
50%|█████████████████████████████████████████████████████████▍ | 499/1000 [02:24<02:25, 3.43it/s]Train Epoch: 1 Batch: [500/1000] Loss: 1.5739 Accuracy: 39.84 % (43.33 %)
60%|████████████████████████████████████████████████████████████████████▉ | 599/1000 [02:54<01:57, 3.41it/s]Train Epoch: 1 Batch: [600/1000] Loss: 1.2768 Accuracy: 40.34 % (53.33 %)
70%|████████████████████████████████████████████████████████████████████████████████▍ | 699/1000 [03:24<01:31, 3.29it/s]Train Epoch: 1 Batch: [700/1000] Loss: 1.6253 Accuracy: 40.80 % (30.00 %)
80%|███████████████████████████████████████████████████████████████████████████████████████████▉ | 799/1000 [03:55<01:01, 3.29it/s]Train Epoch: 1 Batch: [800/1000] Loss: 1.3110 Accuracy: 41.17 % (53.33 %)
90%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 899/1000 [04:26<00:31, 3.26it/s]Train Epoch: 1 Batch: [900/1000] Loss: 1.3345 Accuracy: 41.54 % (56.67 %)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 999/1000 [04:57<00:00, 3.16it/s]Train Epoch: 1 Batch: [1000/1000] Loss: 1.2660 Accuracy: 41.81 % (56.67 %)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [04:57<00:00, 3.20it/s]
0%| | 1/2000 [00:00<09:57, 3.35it/s]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524577523076/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Exception KeyError: KeyError(<weakref at 0x7f60e517ae10; to 'tqdm' at 0x7f60c12bbd10>,) in <bound method tqdm.del of 0%| | 1/2000 [00:00<09:57, 3.35it/s]> ignored
Traceback (most recent call last):
File "train.py", line 245, in
emb_query = embedding_net(data_query.reshape([-1] + list(data_query.shape[-3:])))
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/MetaOptNet/models/ResNet12_embedding.py", line 114, in forward
x = self.layer2(x)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/MetaOptNet/models/ResNet12_embedding.py", line 54, in forward
residual = self.downsample(x)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524577523076/work/aten/src/THC/generic/THCStorage.cu:58

from metaoptnet.

kjunelee avatar kjunelee commented on June 16, 2024

Hmm. This never happened to me. Can you try --head ProtoNet?

from metaoptnet.

ckcraig01 avatar ckcraig01 commented on June 16, 2024

Still have the same problem with this:
python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 5 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 1 --head ProtoNet

May I know how much memory does it need? (mine is 8G per gpu) (it is weird that the OOM happened at the 2nd epoch)
Additionally, I found that only 2GB in 8GB memory is used per gpu even I use 2 gpus and set --episodes-per-batch to 1.
In the mean time, this works:
python train.py --gpu 0,1 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 5 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 1
Is there any memory utilization upper bound setting in the codebase?

I have surveyed some links for your reference. In this link,
NVIDIA/FastPhotoStyle#11
mentioned: @z1412247644 Thanks for your interests. It looks like a GPU memory problem. Would you try to resize your input image first (smaller), in order to fit your GPU memory?

Some related topics
NVIDIA/FastPhotoStyle#27
NVIDIA/FastPhotoStyle#28
pytorch/pytorch#958

from metaoptnet.

kjunelee avatar kjunelee commented on June 16, 2024

The code was tested on Titan X GPUs which have 12GB RAM.

from metaoptnet.

ckcraig01 avatar ckcraig01 commented on June 16, 2024

I see. But as my previous observation. for batch = 1, gpu=2, the utilization is 2GB per gpu so there is a total 4GB of memory used.
I would like to know when you run the training process (8 batches~=2batches per GPU), what is the memory utilization per GPU?
Just wonder if there is a setting for memory utilization upper bound.

from metaoptnet.

kjunelee avatar kjunelee commented on June 16, 2024

In my case, when 8 batches are used, the utilization becomes near 12GB per GPU.

from metaoptnet.

MAT0RIX avatar MAT0RIX commented on June 16, 2024

problem solved.

from metaoptnet.

harry2636 avatar harry2636 commented on June 16, 2024

I underwent the same issue mentioned by @ckcraig01. (OOM in 2nd epoch.)

I solved this issue by upgrading the Pytorch to the latest version.

After I upgraded my Pytorch version from 0.4 to 1.1.0, there was no OOM anymore.
I think there are some memory optimization issues in older Pytorch versions.

from metaoptnet.

kjunelee avatar kjunelee commented on June 16, 2024

Thanks for sharing!

from metaoptnet.

WonderSeven avatar WonderSeven commented on June 16, 2024

Hi, I just test in Pytorch1.2.0 on 2080Ti(11G) and I solve the OOM problem by decreasing the batch size to 1 batch per GPU. It seems that the original setting is not suitable for the GPUs with memory less than 12G.

from metaoptnet.

kjunelee avatar kjunelee commented on June 16, 2024

Sorry for the inconvenience. This code was tested on Titan X/Xp. We haven't tried the code on GPUs with 11GB memory.

from metaoptnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.