Comments (15)
To get an optimal performance, we need to set episodes_per_batch to 8, which requires 4 GPUs.
In this case,
network = torch.nn.DataParallel(network, device_ids=[0,1,2,3])
python train.py --gpu 0,1,2,3 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 15 --head SVM --network ResNet --dataset miniImageNet --eps 0.1
If we can afford only 1 GPU, we can set episodes_per_batch to 2, then it will work without an OOM error.
In this case,
(You may comment out this line) network = torch.nn.DataParallel(network, device_ids=[0])
python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 15 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes_per_batch 2
Let me know if you have any questions.
from metaoptnet.
Thanks. Base on your suggestion, I have comment out the very line.
And use command as:
python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 15 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 2
but still encountered OOM condition.
Then I found out I can successful run 1/2 episodes-per-batch with 1/2 gpus. (it could be I have less memory per gpu)
by setting:
network = torch.nn.DataParallel(network, device_ids=[0,1])
Thanks again for your detailed explanation.
from metaoptnet.
Hi Kwonjoon:
Sorry it's me again:
I was training with single gpu and --episodes-per-batch 1,
it is strange that I have gone through the first 1000 batch but OOM again at epoch 2.
Please let me know if more information is required. Many thanks.
~/MetaOptNet$ python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 15 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 1
Loading mini ImageNet dataset - phase train
Loading mini ImageNet dataset - phase val
('using gpu:', '0')
{'episodes_per_batch': 1, 'head': 'SVM', 'val_query': 15, 'test_way': 5, 'train_way': 5, 'eps': 0.1, 'save_epoch': 10, 'val_episode': 2000, 'num_epoch': 60, 'train_query': 6, 'save_path': './experiments/miniImageNet_MetaOptNet_SVM', 'train_shot': 15, 'val_shot': 5, 'gpu': '0', 'dataset': 'miniImageNet', 'network': 'ResNet'}
Train Epoch: 1 Learning Rate: 0.1000
10%|███████████▍ | 99/1000 [01:04<09:34, 1.57it/s]Train Epoch: 1 Batch: [100/1000] Loss: 1.4024 Accuracy: 46.83 % (56.67 %)
20%|██████████████████████▉ | 199/1000 [02:09<08:43, 1.53it/s]Train Epoch: 1 Batch: [200/1000] Loss: 1.4666 Accuracy: 46.93 % (46.67 %)
30%|██████████████████████████████████▍ | 299/1000 [03:13<07:46, 1.50it/s]Train Epoch: 1 Batch: [300/1000] Loss: 1.4943 Accuracy: 47.28 % (40.00 %)
40%|█████████████████████████████████████████████▉ | 399/1000 [04:22<06:36, 1.52it/s]Train Epoch: 1 Batch: [400/1000] Loss: 1.4684 Accuracy: 47.37 % (43.33 %)
50%|█████████████████████████████████████████████████████████▍ | 499/1000 [05:31<05:45, 1.45it/s]Train Epoch: 1 Batch: [500/1000] Loss: 1.5299 Accuracy: 47.83 % (33.33 %)
60%|████████████████████████████████████████████████████████████████████▉ | 599/1000 [06:39<04:32, 1.47it/s]Train Epoch: 1 Batch: [600/1000] Loss: 1.4971 Accuracy: 48.47 % (33.33 %)
70%|████████████████████████████████████████████████████████████████████████████████▍ | 699/1000 [07:48<03:26, 1.45it/s]Train Epoch: 1 Batch: [700/1000] Loss: 1.3542 Accuracy: 48.91 % (43.33 %)
80%|███████████████████████████████████████████████████████████████████████████████████████████▉ | 799/1000 [08:57<02:16, 1.47it/s]Train Epoch: 1 Batch: [800/1000] Loss: 1.2025 Accuracy: 49.25 % (60.00 %)
90%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 899/1000 [10:06<01:09, 1.45it/s]Train Epoch: 1 Batch: [900/1000] Loss: 1.4882 Accuracy: 49.54 % (33.33 %)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 999/1000 [11:15<00:00, 1.44it/s]Train Epoch: 1 Batch: [1000/1000] Loss: 1.1104 Accuracy: 49.76 % (66.67 %)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [11:16<00:00, 1.45it/s]
0%| | 1/2000 [00:00<16:29, 2.02it/s]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524577523076/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Exception KeyError: KeyError(<weakref at 0x7fa14aa80e10; to 'tqdm' at 0x7fa126bc1d10>,) in <bound method tqdm.del of 0%| | 1/2000 [00:01<16:29, 2.02it/s]> ignored
Traceback (most recent call last):
File "train.py", line 245, in
emb_query = embedding_net(data_query.reshape([-1] + list(data_query.shape[-3:])))
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/MetaOptNet/models/ResNet12_embedding.py", line 114, in forward
x = self.layer2(x)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx//MetaOptNet/models/ResNet12_embedding.py", line 56, in forward
out = self.relu(out)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/activation.py", line 447, in forward
return F.leaky_relu(input, self.negative_slope, self.inplace)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/functional.py", line 731, in leaky_relu
return torch._C._nn.leaky_relu(input, negative_slope)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524577523076/work/aten/src/THC/generic/THCStorage.cu:58
from metaoptnet.
It is weird.
Can you change try --train-shot 5?
from metaoptnet.
I tried --train-shot 5, the below is the log. The message seems different but the issue might be the same.
(metaopnet) x~/MetaOptNet$ python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 5 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 1
Loading mini ImageNet dataset - phase train
Loading mini ImageNet dataset - phase val
('using gpu:', '0')
{'episodes_per_batch': 1, 'head': 'SVM', 'val_query': 15, 'test_way': 5, 'train_way': 5, 'eps': 0.1, 'save_epoch': 10, 'val_episode': 2000, 'num_epoch': 60, 'train_query': 6, 'save_path': './experiments/miniImageNet_MetaOptNet_SVM', 'train_shot': 5, 'val_shot': 5, 'gpu': '0', 'dataset': 'miniImageNet', 'network': 'ResNet'}
Train Epoch: 1 Learning Rate: 0.1000
10%|███████████▍ | 99/1000 [00:28<04:20, 3.46it/s]Train Epoch: 1 Batch: [100/1000] Loss: 1.5324 Accuracy: 39.87 % (36.67 %)
20%|██████████████████████▉ | 199/1000 [00:57<03:59, 3.34it/s]Train Epoch: 1 Batch: [200/1000] Loss: 1.4797 Accuracy: 39.05 % (43.33 %)
30%|██████████████████████████████████▍ | 299/1000 [01:26<03:25, 3.41it/s]Train Epoch: 1 Batch: [300/1000] Loss: 1.3376 Accuracy: 39.41 % (53.33 %)
40%|█████████████████████████████████████████████▉ | 399/1000 [01:55<02:52, 3.48it/s]Train Epoch: 1 Batch: [400/1000] Loss: 1.3189 Accuracy: 39.65 % (53.33 %)
50%|█████████████████████████████████████████████████████████▍ | 499/1000 [02:24<02:25, 3.43it/s]Train Epoch: 1 Batch: [500/1000] Loss: 1.5739 Accuracy: 39.84 % (43.33 %)
60%|████████████████████████████████████████████████████████████████████▉ | 599/1000 [02:54<01:57, 3.41it/s]Train Epoch: 1 Batch: [600/1000] Loss: 1.2768 Accuracy: 40.34 % (53.33 %)
70%|████████████████████████████████████████████████████████████████████████████████▍ | 699/1000 [03:24<01:31, 3.29it/s]Train Epoch: 1 Batch: [700/1000] Loss: 1.6253 Accuracy: 40.80 % (30.00 %)
80%|███████████████████████████████████████████████████████████████████████████████████████████▉ | 799/1000 [03:55<01:01, 3.29it/s]Train Epoch: 1 Batch: [800/1000] Loss: 1.3110 Accuracy: 41.17 % (53.33 %)
90%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 899/1000 [04:26<00:31, 3.26it/s]Train Epoch: 1 Batch: [900/1000] Loss: 1.3345 Accuracy: 41.54 % (56.67 %)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 999/1000 [04:57<00:00, 3.16it/s]Train Epoch: 1 Batch: [1000/1000] Loss: 1.2660 Accuracy: 41.81 % (56.67 %)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [04:57<00:00, 3.20it/s]
0%| | 1/2000 [00:00<09:57, 3.35it/s]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524577523076/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Exception KeyError: KeyError(<weakref at 0x7f60e517ae10; to 'tqdm' at 0x7f60c12bbd10>,) in <bound method tqdm.del of 0%| | 1/2000 [00:00<09:57, 3.35it/s]> ignored
Traceback (most recent call last):
File "train.py", line 245, in
emb_query = embedding_net(data_query.reshape([-1] + list(data_query.shape[-3:])))
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/MetaOptNet/models/ResNet12_embedding.py", line 114, in forward
x = self.layer2(x)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/MetaOptNet/models/ResNet12_embedding.py", line 54, in forward
residual = self.downsample(x)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/metaopnet/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524577523076/work/aten/src/THC/generic/THCStorage.cu:58
from metaoptnet.
Hmm. This never happened to me. Can you try --head ProtoNet?
from metaoptnet.
Still have the same problem with this:
python train.py --gpu 0 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 5 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 1 --head ProtoNet
May I know how much memory does it need? (mine is 8G per gpu) (it is weird that the OOM happened at the 2nd epoch)
Additionally, I found that only 2GB in 8GB memory is used per gpu even I use 2 gpus and set --episodes-per-batch to 1.
In the mean time, this works:
python train.py --gpu 0,1 --save-path "./experiments/miniImageNet_MetaOptNet_SVM" --train-shot 5 --head SVM --network ResNet --dataset miniImageNet --eps 0.1 --episodes-per-batch 1
Is there any memory utilization upper bound setting in the codebase?
I have surveyed some links for your reference. In this link,
NVIDIA/FastPhotoStyle#11
mentioned: @z1412247644 Thanks for your interests. It looks like a GPU memory problem. Would you try to resize your input image first (smaller), in order to fit your GPU memory?
Some related topics
NVIDIA/FastPhotoStyle#27
NVIDIA/FastPhotoStyle#28
pytorch/pytorch#958
from metaoptnet.
The code was tested on Titan X GPUs which have 12GB RAM.
from metaoptnet.
I see. But as my previous observation. for batch = 1, gpu=2, the utilization is 2GB per gpu so there is a total 4GB of memory used.
I would like to know when you run the training process (8 batches~=2batches per GPU), what is the memory utilization per GPU?
Just wonder if there is a setting for memory utilization upper bound.
from metaoptnet.
In my case, when 8 batches are used, the utilization becomes near 12GB per GPU.
from metaoptnet.
problem solved.
from metaoptnet.
I underwent the same issue mentioned by @ckcraig01. (OOM in 2nd epoch.)
I solved this issue by upgrading the Pytorch to the latest version.
After I upgraded my Pytorch version from 0.4 to 1.1.0, there was no OOM anymore.
I think there are some memory optimization issues in older Pytorch versions.
from metaoptnet.
Thanks for sharing!
from metaoptnet.
Hi, I just test in Pytorch1.2.0 on 2080Ti(11G) and I solve the OOM problem by decreasing the batch size to 1 batch per GPU. It seems that the original setting is not suitable for the GPUs with memory less than 12G.
from metaoptnet.
Sorry for the inconvenience. This code was tested on Titan X/Xp. We haven't tried the code on GPUs with 11GB memory.
from metaoptnet.
Related Issues (20)
- Some questions about MetaOptNetHead_Ridge HOT 2
- About accuracy in CIFAR_FS 5-way 5-shot and how to implement MetaOptNet-SVM-trainval HOT 2
- Keep-rate scheduling of DropBlock in a multi-GPU environment HOT 2
- Why the accuracy of the Prototypical Network is higher than the reported version in paper? HOT 1
- Where is the parameter gamma HOT 1
- Does the performance of different SVM heads vary largely? HOT 1
- question about "--episodes-per-batch" HOT 2
- Meta gradient Computation HOT 3
- About"TypeError: btrisolve() takes 3 positional arguments but 4 were given"
- what is the difference between novel categories and base category? HOT 1
- Question about meta-validation and meta-testing HOT 4
- Protonet re-implementation details HOT 3
- Parameters for ProtoNet using ResNet12 as backbone
- the parameters config for the cifarfs,the accuracy is only 63% HOT 1
- could you tell me the link which about the miniImageNet_category_split_train_phase_train.pickle ?
- Thanks and some questions
- Overlapping between meta-training classes and meta-testing classes HOT 1
- How do I know what the real category of tieredImagenet is? HOT 1
- Pretrained model
- Project dependencies may have API risk issues
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from metaoptnet.