fmu2 / pytorch-maml Goto Github PK

A PyTorch implementation of Model Agnostic Meta-Learning (MAML) that faithfully reproduces the results from the original paper.

Python 100.00%

deep-learning few-shot-learning maml meta-learning

pytorch-maml's People

Contributors

Stargazers

Watchers

pytorch-maml's Issues

how to format CUB200 csv file?

May i ask what's your format of CUB200 csv file and how to get the csv file ? Thanks

2 GPU Training

你好，我想要问一下，为什么我用2个GPU运算的时候，一直训练不起来，但是使用单个GPU运算时便可以

5CNN MAML checkpoints?

Hi,

do you have any MAML checkpoints that do get the 63% acc the original paper got?

This is a great MAML repo. The convnet4 cases work well, but I have encountered some problems employing resnet12.
I only changed the encoder in the config from convnet4 to resnet12.
Since the resnet12 network requires large GPU memory, I temporarily changed the n_step in inner_args from 5 to 1. And I encountered the following error:

/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [791,0,0], thread: [37,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [791,0,0], thread: [38,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [791,0,0], thread: [39,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [791,0,0], thread: [44,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
......
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [2161,0,0], thread: [24,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [2161,0,0], thread: [25,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [2161,0,0], thread: [26,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [2161,0,0], thread: [27,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "train.py", line 266, in <module>
    main(config)
  File "train.py", line 142, in main
    loss.backward()
  File "/root/miniconda3/envs/gm/lib/python3.6/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/miniconda3/envs/gm/lib/python3.6/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered

This error usually occurs when the class number is set smaller than the size of the classifier layer.
However, it should be consistent with the convnet4 case, and I have checked the output logits to have a [300, 5] shape.

I will appreciate it if you can take a look at the problem.

image-size

How to call the encoder separately

When I call the encoder separately my kernel crashes

multi-gpu experiments

Hi Fangzhou, thanks for your codebase! With your codebase, I can reproduce the 5-way-1-shot mini-imagenet experiment with convnet4, single GPU training. However, when I switch to multiple GPUs and use the same parameter config file, the performance is not good.

single GPU

meta-train set: torch.Size([5, 3, 84, 84]) (x800), 64
meta-val set: torch.Size([5, 3, 84, 84]) (x800), 16
num params: 32.9K
epoch 1, meta-train 1.5233|0.3346, meta-val 1.5522|0.3136, 2.8m 2.8m/14.1h
epoch 2, meta-train 1.4704|0.3686, meta-val 1.5350|0.3228, 2.8m 5.6m/14.1h
epoch 3, meta-train 1.4400|0.3909, meta-val 1.5029|0.3471, 2.8m 8.5m/14.1h
epoch 4, meta-train 1.4185|0.4017, meta-val 1.4785|0.3613, 2.8m 11.2m/14.1h
epoch 5, meta-train 1.3943|0.4204, meta-val 1.4737|0.3663, 2.8m 14.0m/14.0h
epoch 6, meta-train 1.3849|0.4223, meta-val 1.4879|0.3593, 2.8m 16.8m/14.0h
epoch 7, meta-train 1.3802|0.4281, meta-val 1.4652|0.3698, 2.7m 19.6m/14.0h
epoch 8, meta-train 1.3552|0.4411, meta-val 1.4479|0.3810, 2.8m 22.4m/14.0h
epoch 9, meta-train 1.3545|0.4401, meta-val 1.4700|0.3695, 2.9m 25.3m/14.0h
epoch 10, meta-train 1.3389|0.4502, meta-val 1.4644|0.3774, 2.8m 28.0m/14.0h
epoch 11, meta-train 1.3404|0.4474, meta-val 1.4527|0.3813, 2.8m 30.8m/14.0h
epoch 12, meta-train 1.3245|0.4556, meta-val 1.4342|0.3890, 2.8m 33.7m/14.0h
epoch 13, meta-train 1.3221|0.4567, meta-val 1.4344|0.3885, 2.8m 36.4m/14.0h
epoch 14, meta-train 1.3186|0.4571, meta-val 1.4400|0.3878, 2.8m 39.2m/14.0h
epoch 15, meta-train 1.3112|0.4636, meta-val 1.4453|0.3817, 2.8m 42.0m/14.0h
epoch 16, meta-train 1.3044|0.4680, meta-val 1.4152|0.4009, 2.8m 44.9m/14.0h
epoch 17, meta-train 1.2959|0.4737, meta-val 1.4172|0.4009, 2.8m 47.7m/14.0h

multi GPU (the performance is always like below, didn't improve at all)

meta-train set: torch.Size([5, 3, 84, 84]) (x800), 64
meta-val set: torch.Size([5, 3, 84, 84]) (x800), 16
num params: 32.9K
epoch 1, meta-train 1.6346|0.1970, meta-val 1.6220|0.1994, 43.2s 43.2s/3.6h
epoch 2, meta-train 1.6246|0.2025, meta-val 1.6192|0.2012, 35.0s 1.3m/3.3h
epoch 3, meta-train 1.6215|0.1998, meta-val 1.6163|0.2021, 35.0s 1.9m/3.2h
epoch 4, meta-train 1.6185|0.1999, meta-val 1.6178|0.1984, 35.3s 2.5m/3.1h
epoch 5, meta-train 1.6165|0.2016, meta-val 1.6120|0.2018, 35.2s 3.1m/3.1h
epoch 6, meta-train 1.6121|0.2015, meta-val 1.6109|0.1986, 35.3s 3.7m/3.1h
epoch 7, meta-train 1.6100|0.2017, meta-val 1.6099|0.1968, 35.9s 4.3m/3.0h
epoch 8, meta-train 1.6100|0.1983, meta-val 1.6098|0.2012, 35.1s 4.8m/3.0h
epoch 9, meta-train 1.6098|0.2004, meta-val 1.6101|0.2003, 35.2s 5.4m/3.0h
epoch 10, meta-train 1.6097|0.1985, meta-val 1.6096|0.2010, 35.2s 6.0m/3.0h
epoch 11, meta-train 1.6096|0.2018, meta-val 1.6097|0.1988, 35.6s 6.6m/3.0h
epoch 12, meta-train 1.6097|0.2001, meta-val 1.6096|0.2000, 36.0s 7.2m/3.0h
epoch 13, meta-train 1.6095|0.2018, meta-val 1.6096|0.2003, 37.5s 7.8m/3.0h
epoch 14, meta-train 1.6097|0.1997, meta-val 1.6095|0.2007, 38.2s 8.5m/3.0h
epoch 15, meta-train 1.6096|0.1981, meta-val 1.6094|0.2025, 35.8s 9.1m/3.0h
epoch 16, meta-train 1.6095|0.1990, meta-val 1.6095|0.1986, 36.5s 9.7m/3.0h
epoch 17, meta-train 1.6096|0.1991, meta-val 1.6095|0.2006, 36.0s 10.3m/3.0h

I am wondering how can I get the multi GPU baseline working, did you change the parameters like the inner update rl and meta rl? And I am wondering why MAML is stuck at a local minimum for this multi GPU experiment. Many thanks and look forward to your reply!

Why no xavier initializers? and a few other questions

Hi. Thanks for publishing this. I have always had a really hard time reproducing the results of the MAML paper and I have to get to the bottom of it for my own sanity. You seem to have followed the original repo quite carefully but I noticed you do not use xavier initializers as they do. This must have been deliberate I assume, so I am curious why you did not use them?

Another repo (https://github.com/haebeom-lee/maml) also didn't use xavier.

I am trying to reproduce the results while using Pytorch higher library https://github.com/facebookresearch/higher and so far I am getting pretty bad overfitting on miniimagenet (5way 1 shot train get to mid 50's accuracy but stays around 30% on test set). I have also gone through everything in the original repo many times and it all seems to be correct and also matching with yours...

Were there any other tricky parts while you were implementing this that you got stuck on?

Batch normalization running statistics

when i set the

encoder_args:
bn_args:
track_running_stats: True

the train will not converge?

for resnet12 and resnet18

did your implementations train from scratch or load the pretrained weights like below?

import torchvision.models as models
resnet18 = models.resnet18(pretrained=True)

MultiGPU Support (DataParallel)

Hi Fangzhou,

Thank you for your excellent work. The codebase is well-organized and easy to follow.

When I tried to train mini-imagenet using either 2 - 8 GPUs by the following command,

python train.py --config=configs/convnet4/mini-imagenet/5_way_1_shot/train_reproduce.yaml --gpu=0,1,2,3

python keeps reporting errors shown as below,

meta-train set: torch.Size([5, 3, 84, 84]) (x800), 64
meta-val set: torch.Size([5, 3, 84, 84]) (x800), 16
num params: 32.9K
Traceback (most recent call last):
  File "train.py", line 265, in <module>
    main(config)
  File "train.py", line 130, in main
    logits = model(x_shot, x_query, y_shot, inner_args, meta_train=True)
  File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "**/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "**/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "**/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "**/PyTorch-MAML/models/maml.py", line 223, in forward
    updated_params = self._adapt(
  File "**/PyTorch-MAML/models/maml.py", line 185, in _adapt
    params, mom_buffer = self._inner_iter(
  File "**/PyTorch-MAML/models/maml.py", line 99, in _inner_iter
    grads = autograd.grad(loss, params.values(),
  File "**/lib/python3.8/site-packages/torch/autograd/__init__.py", line 234, in grad
    return Variable._execution_engine.run_backward(
ValueError: grad requires non-empty inputs.

However, the code only works while using 1 GPU. When n_episode=4, I assume the code should work on 2 or 4 GPUs.

Framework Versions:

python: 3.8
pytorch: 1.10.1 py3.8_cuda11.3_cudnn8.2.0_0

Our ultimate goal is to transfer this repo to our project but find the same errors reported. Any hints or help are highly appreciated. Thanks!

why loading pile files instead of loading images?

I saw other implementations loading images from the train/val/test folders as the preprocessing step.
https://github.com/yaoyao-liu/mini-imagenet-tools
I am just curious if using pickle to load the data all at once is memory efficient?
https://github.com/fmu2/PyTorch-MAML/blob/master/datasets/mini_imagenet.py#L18-L32

MAML Inductive setting

Hi, I think bringing up the inductive bias of MAML method is very interesting. How exactly could we get the results of MAML in the inductive setting using your implementation?

Should we be setting episodic variable to True of BatchNorm2d class

PyTorch-MAML/models/modules.py

Lines 96 to 99 in 19246a1

 class BatchNorm2d(nn.BatchNorm2d, Module): 

 def __init__(self, num_features, eps=1e-5, momentum=0.1, affine=True, 

 track_running_stats=True, episodic=False, n_episode=4, 

 alpha=False):

during meta-testing phase. That way, the normalization statistics would be based on a single task and not the mini-batch.

Thank you.

	class BatchNorm2d(nn.BatchNorm2d, Module):
	def __init__(self, num_features, eps=1e-5, momentum=0.1, affine=True,
	track_running_stats=True, episodic=False, n_episode=4,
	alpha=False):

fmu2 / pytorch-maml Goto Github PK

pytorch-maml's People

Contributors

Stargazers

Watchers

Forkers

pytorch-maml's Issues

Recommend Projects

Recommend Topics

Recommend Org