fmu2 / pytorch-maml Goto Github PK
View Code? Open in Web Editor NEWA PyTorch implementation of Model Agnostic Meta-Learning (MAML) that faithfully reproduces the results from the original paper.
A PyTorch implementation of Model Agnostic Meta-Learning (MAML) that faithfully reproduces the results from the original paper.
May i ask what's your format of CUB200 csv file and how to get the csv file ? Thanks
你好,我想要问一下,为什么我用2个GPU运算的时候,一直训练不起来,但是使用单个GPU运算时便可以
Hi,
do you have any MAML checkpoints that do get the 63% acc the original paper got?
your ResNet18 is wrong.
Hi there,
This is a great MAML repo. The convnet4 cases work well, but I have encountered some problems employing resnet12.
I only changed the encoder in the config from convnet4 to resnet12.
Since the resnet12 network requires large GPU memory, I temporarily changed the n_step in inner_args from 5 to 1. And I encountered the following error:
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [791,0,0], thread: [37,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [791,0,0], thread: [38,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [791,0,0], thread: [39,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [791,0,0], thread: [44,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
......
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [2161,0,0], thread: [24,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [2161,0,0], thread: [25,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [2161,0,0], thread: [26,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [2161,0,0], thread: [27,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
File "train.py", line 266, in <module>
main(config)
File "train.py", line 142, in main
loss.backward()
File "/root/miniconda3/envs/gm/lib/python3.6/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/miniconda3/envs/gm/lib/python3.6/site-packages/torch/autograd/__init__.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered
This error usually occurs when the class number is set smaller than the size of the classifier layer.
However, it should be consistent with the convnet4 case, and I have checked the output logits to have a [300, 5] shape.
I will appreciate it if you can take a look at the problem.
When I call the encoder separately my kernel crashes
Hi Fangzhou, thanks for your codebase! With your codebase, I can reproduce the 5-way-1-shot mini-imagenet experiment with convnet4, single GPU training. However, when I switch to multiple GPUs and use the same parameter config file, the performance is not good.
meta-train set: torch.Size([5, 3, 84, 84]) (x800), 64
meta-val set: torch.Size([5, 3, 84, 84]) (x800), 16
num params: 32.9K
epoch 1, meta-train 1.5233|0.3346, meta-val 1.5522|0.3136, 2.8m 2.8m/14.1h
epoch 2, meta-train 1.4704|0.3686, meta-val 1.5350|0.3228, 2.8m 5.6m/14.1h
epoch 3, meta-train 1.4400|0.3909, meta-val 1.5029|0.3471, 2.8m 8.5m/14.1h
epoch 4, meta-train 1.4185|0.4017, meta-val 1.4785|0.3613, 2.8m 11.2m/14.1h
epoch 5, meta-train 1.3943|0.4204, meta-val 1.4737|0.3663, 2.8m 14.0m/14.0h
epoch 6, meta-train 1.3849|0.4223, meta-val 1.4879|0.3593, 2.8m 16.8m/14.0h
epoch 7, meta-train 1.3802|0.4281, meta-val 1.4652|0.3698, 2.7m 19.6m/14.0h
epoch 8, meta-train 1.3552|0.4411, meta-val 1.4479|0.3810, 2.8m 22.4m/14.0h
epoch 9, meta-train 1.3545|0.4401, meta-val 1.4700|0.3695, 2.9m 25.3m/14.0h
epoch 10, meta-train 1.3389|0.4502, meta-val 1.4644|0.3774, 2.8m 28.0m/14.0h
epoch 11, meta-train 1.3404|0.4474, meta-val 1.4527|0.3813, 2.8m 30.8m/14.0h
epoch 12, meta-train 1.3245|0.4556, meta-val 1.4342|0.3890, 2.8m 33.7m/14.0h
epoch 13, meta-train 1.3221|0.4567, meta-val 1.4344|0.3885, 2.8m 36.4m/14.0h
epoch 14, meta-train 1.3186|0.4571, meta-val 1.4400|0.3878, 2.8m 39.2m/14.0h
epoch 15, meta-train 1.3112|0.4636, meta-val 1.4453|0.3817, 2.8m 42.0m/14.0h
epoch 16, meta-train 1.3044|0.4680, meta-val 1.4152|0.4009, 2.8m 44.9m/14.0h
epoch 17, meta-train 1.2959|0.4737, meta-val 1.4172|0.4009, 2.8m 47.7m/14.0h
meta-train set: torch.Size([5, 3, 84, 84]) (x800), 64
meta-val set: torch.Size([5, 3, 84, 84]) (x800), 16
num params: 32.9K
epoch 1, meta-train 1.6346|0.1970, meta-val 1.6220|0.1994, 43.2s 43.2s/3.6h
epoch 2, meta-train 1.6246|0.2025, meta-val 1.6192|0.2012, 35.0s 1.3m/3.3h
epoch 3, meta-train 1.6215|0.1998, meta-val 1.6163|0.2021, 35.0s 1.9m/3.2h
epoch 4, meta-train 1.6185|0.1999, meta-val 1.6178|0.1984, 35.3s 2.5m/3.1h
epoch 5, meta-train 1.6165|0.2016, meta-val 1.6120|0.2018, 35.2s 3.1m/3.1h
epoch 6, meta-train 1.6121|0.2015, meta-val 1.6109|0.1986, 35.3s 3.7m/3.1h
epoch 7, meta-train 1.6100|0.2017, meta-val 1.6099|0.1968, 35.9s 4.3m/3.0h
epoch 8, meta-train 1.6100|0.1983, meta-val 1.6098|0.2012, 35.1s 4.8m/3.0h
epoch 9, meta-train 1.6098|0.2004, meta-val 1.6101|0.2003, 35.2s 5.4m/3.0h
epoch 10, meta-train 1.6097|0.1985, meta-val 1.6096|0.2010, 35.2s 6.0m/3.0h
epoch 11, meta-train 1.6096|0.2018, meta-val 1.6097|0.1988, 35.6s 6.6m/3.0h
epoch 12, meta-train 1.6097|0.2001, meta-val 1.6096|0.2000, 36.0s 7.2m/3.0h
epoch 13, meta-train 1.6095|0.2018, meta-val 1.6096|0.2003, 37.5s 7.8m/3.0h
epoch 14, meta-train 1.6097|0.1997, meta-val 1.6095|0.2007, 38.2s 8.5m/3.0h
epoch 15, meta-train 1.6096|0.1981, meta-val 1.6094|0.2025, 35.8s 9.1m/3.0h
epoch 16, meta-train 1.6095|0.1990, meta-val 1.6095|0.1986, 36.5s 9.7m/3.0h
epoch 17, meta-train 1.6096|0.1991, meta-val 1.6095|0.2006, 36.0s 10.3m/3.0h
I am wondering how can I get the multi GPU baseline working, did you change the parameters like the inner update rl and meta rl? And I am wondering why MAML is stuck at a local minimum for this multi GPU experiment. Many thanks and look forward to your reply!
Hi. Thanks for publishing this. I have always had a really hard time reproducing the results of the MAML paper and I have to get to the bottom of it for my own sanity. You seem to have followed the original repo quite carefully but I noticed you do not use xavier initializers as they do. This must have been deliberate I assume, so I am curious why you did not use them?
Another repo (https://github.com/haebeom-lee/maml) also didn't use xavier.
I am trying to reproduce the results while using Pytorch higher library https://github.com/facebookresearch/higher and so far I am getting pretty bad overfitting on miniimagenet (5way 1 shot train get to mid 50's accuracy but stays around 30% on test set). I have also gone through everything in the original repo many times and it all seems to be correct and also matching with yours...
Were there any other tricky parts while you were implementing this that you got stuck on?
when i set the
encoder_args:
bn_args:
track_running_stats: True
the train will not converge?
did your implementations train from scratch or load the pretrained weights like below?
import torchvision.models as models
resnet18 = models.resnet18(pretrained=True)
Hi Fangzhou,
Thank you for your excellent work. The codebase is well-organized and easy to follow.
When I tried to train mini-imagenet
using either 2 - 8 GPUs by the following command,
python train.py --config=configs/convnet4/mini-imagenet/5_way_1_shot/train_reproduce.yaml --gpu=0,1,2,3
python keeps reporting errors shown as below,
meta-train set: torch.Size([5, 3, 84, 84]) (x800), 64
meta-val set: torch.Size([5, 3, 84, 84]) (x800), 16
num params: 32.9K
Traceback (most recent call last):
File "train.py", line 265, in <module>
main(config)
File "train.py", line 130, in main
logits = model(x_shot, x_query, y_shot, inner_args, meta_train=True)
File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "**/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "**/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
raise exception
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "**/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "**/PyTorch-MAML/models/maml.py", line 223, in forward
updated_params = self._adapt(
File "**/PyTorch-MAML/models/maml.py", line 185, in _adapt
params, mom_buffer = self._inner_iter(
File "**/PyTorch-MAML/models/maml.py", line 99, in _inner_iter
grads = autograd.grad(loss, params.values(),
File "**/lib/python3.8/site-packages/torch/autograd/__init__.py", line 234, in grad
return Variable._execution_engine.run_backward(
ValueError: grad requires non-empty inputs.
However, the code only works while using 1 GPU. When n_episode=4
, I assume the code should work on 2 or 4 GPUs.
Framework Versions:
python: 3.8
pytorch: 1.10.1 py3.8_cuda11.3_cudnn8.2.0_0
Our ultimate goal is to transfer this repo to our project but find the same errors reported. Any hints or help are highly appreciated. Thanks!
I saw other implementations loading images from the train/val/test folders as the preprocessing step.
https://github.com/yaoyao-liu/mini-imagenet-tools
I am just curious if using pickle to load the data all at once is memory efficient?
https://github.com/fmu2/PyTorch-MAML/blob/master/datasets/mini_imagenet.py#L18-L32
Hi, I think bringing up the inductive bias of MAML method is very interesting. How exactly could we get the results of MAML in the inductive setting using your implementation?
Should we be setting episodic variable to True of BatchNorm2d class
PyTorch-MAML/models/modules.py
Lines 96 to 99 in 19246a1
Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.