Git Product home page Git Product logo

pytorch-maml's People

Contributors

fmu2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pytorch-maml's Issues

2 GPU Training

你好,我想要问一下,为什么我用2个GPU运算的时候,一直训练不起来,但是使用单个GPU运算时便可以

resnet12 running error

Hi there,

This is a great MAML repo. The convnet4 cases work well, but I have encountered some problems employing resnet12.
I only changed the encoder in the config from convnet4 to resnet12.
Since the resnet12 network requires large GPU memory, I temporarily changed the n_step in inner_args from 5 to 1. And I encountered the following error:

/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [791,0,0], thread: [37,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [791,0,0], thread: [38,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [791,0,0], thread: [39,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [791,0,0], thread: [44,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
......
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [2161,0,0], thread: [24,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [2161,0,0], thread: [25,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [2161,0,0], thread: [26,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [2161,0,0], thread: [27,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "train.py", line 266, in <module>
    main(config)
  File "train.py", line 142, in main
    loss.backward()
  File "/root/miniconda3/envs/gm/lib/python3.6/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/miniconda3/envs/gm/lib/python3.6/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered

This error usually occurs when the class number is set smaller than the size of the classifier layer.
However, it should be consistent with the convnet4 case, and I have checked the output logits to have a [300, 5] shape.

I will appreciate it if you can take a look at the problem.

multi-gpu experiments

Hi Fangzhou, thanks for your codebase! With your codebase, I can reproduce the 5-way-1-shot mini-imagenet experiment with convnet4, single GPU training. However, when I switch to multiple GPUs and use the same parameter config file, the performance is not good.

  • single GPU
meta-train set: torch.Size([5, 3, 84, 84]) (x800), 64
meta-val set: torch.Size([5, 3, 84, 84]) (x800), 16
num params: 32.9K
epoch 1, meta-train 1.5233|0.3346, meta-val 1.5522|0.3136, 2.8m 2.8m/14.1h
epoch 2, meta-train 1.4704|0.3686, meta-val 1.5350|0.3228, 2.8m 5.6m/14.1h
epoch 3, meta-train 1.4400|0.3909, meta-val 1.5029|0.3471, 2.8m 8.5m/14.1h
epoch 4, meta-train 1.4185|0.4017, meta-val 1.4785|0.3613, 2.8m 11.2m/14.1h
epoch 5, meta-train 1.3943|0.4204, meta-val 1.4737|0.3663, 2.8m 14.0m/14.0h
epoch 6, meta-train 1.3849|0.4223, meta-val 1.4879|0.3593, 2.8m 16.8m/14.0h
epoch 7, meta-train 1.3802|0.4281, meta-val 1.4652|0.3698, 2.7m 19.6m/14.0h
epoch 8, meta-train 1.3552|0.4411, meta-val 1.4479|0.3810, 2.8m 22.4m/14.0h
epoch 9, meta-train 1.3545|0.4401, meta-val 1.4700|0.3695, 2.9m 25.3m/14.0h
epoch 10, meta-train 1.3389|0.4502, meta-val 1.4644|0.3774, 2.8m 28.0m/14.0h
epoch 11, meta-train 1.3404|0.4474, meta-val 1.4527|0.3813, 2.8m 30.8m/14.0h
epoch 12, meta-train 1.3245|0.4556, meta-val 1.4342|0.3890, 2.8m 33.7m/14.0h
epoch 13, meta-train 1.3221|0.4567, meta-val 1.4344|0.3885, 2.8m 36.4m/14.0h
epoch 14, meta-train 1.3186|0.4571, meta-val 1.4400|0.3878, 2.8m 39.2m/14.0h
epoch 15, meta-train 1.3112|0.4636, meta-val 1.4453|0.3817, 2.8m 42.0m/14.0h
epoch 16, meta-train 1.3044|0.4680, meta-val 1.4152|0.4009, 2.8m 44.9m/14.0h
epoch 17, meta-train 1.2959|0.4737, meta-val 1.4172|0.4009, 2.8m 47.7m/14.0h
  • multi GPU (the performance is always like below, didn't improve at all)
meta-train set: torch.Size([5, 3, 84, 84]) (x800), 64
meta-val set: torch.Size([5, 3, 84, 84]) (x800), 16
num params: 32.9K
epoch 1, meta-train 1.6346|0.1970, meta-val 1.6220|0.1994, 43.2s 43.2s/3.6h
epoch 2, meta-train 1.6246|0.2025, meta-val 1.6192|0.2012, 35.0s 1.3m/3.3h
epoch 3, meta-train 1.6215|0.1998, meta-val 1.6163|0.2021, 35.0s 1.9m/3.2h
epoch 4, meta-train 1.6185|0.1999, meta-val 1.6178|0.1984, 35.3s 2.5m/3.1h
epoch 5, meta-train 1.6165|0.2016, meta-val 1.6120|0.2018, 35.2s 3.1m/3.1h
epoch 6, meta-train 1.6121|0.2015, meta-val 1.6109|0.1986, 35.3s 3.7m/3.1h
epoch 7, meta-train 1.6100|0.2017, meta-val 1.6099|0.1968, 35.9s 4.3m/3.0h
epoch 8, meta-train 1.6100|0.1983, meta-val 1.6098|0.2012, 35.1s 4.8m/3.0h
epoch 9, meta-train 1.6098|0.2004, meta-val 1.6101|0.2003, 35.2s 5.4m/3.0h
epoch 10, meta-train 1.6097|0.1985, meta-val 1.6096|0.2010, 35.2s 6.0m/3.0h
epoch 11, meta-train 1.6096|0.2018, meta-val 1.6097|0.1988, 35.6s 6.6m/3.0h
epoch 12, meta-train 1.6097|0.2001, meta-val 1.6096|0.2000, 36.0s 7.2m/3.0h
epoch 13, meta-train 1.6095|0.2018, meta-val 1.6096|0.2003, 37.5s 7.8m/3.0h
epoch 14, meta-train 1.6097|0.1997, meta-val 1.6095|0.2007, 38.2s 8.5m/3.0h
epoch 15, meta-train 1.6096|0.1981, meta-val 1.6094|0.2025, 35.8s 9.1m/3.0h
epoch 16, meta-train 1.6095|0.1990, meta-val 1.6095|0.1986, 36.5s 9.7m/3.0h
epoch 17, meta-train 1.6096|0.1991, meta-val 1.6095|0.2006, 36.0s 10.3m/3.0h

I am wondering how can I get the multi GPU baseline working, did you change the parameters like the inner update rl and meta rl? And I am wondering why MAML is stuck at a local minimum for this multi GPU experiment. Many thanks and look forward to your reply!

Why no xavier initializers? and a few other questions

Hi. Thanks for publishing this. I have always had a really hard time reproducing the results of the MAML paper and I have to get to the bottom of it for my own sanity. You seem to have followed the original repo quite carefully but I noticed you do not use xavier initializers as they do. This must have been deliberate I assume, so I am curious why you did not use them?

Another repo (https://github.com/haebeom-lee/maml) also didn't use xavier.

I am trying to reproduce the results while using Pytorch higher library https://github.com/facebookresearch/higher and so far I am getting pretty bad overfitting on miniimagenet (5way 1 shot train get to mid 50's accuracy but stays around 30% on test set). I have also gone through everything in the original repo many times and it all seems to be correct and also matching with yours...

Were there any other tricky parts while you were implementing this that you got stuck on?

for resnet12 and resnet18

did your implementations train from scratch or load the pretrained weights like below?

import torchvision.models as models
resnet18 = models.resnet18(pretrained=True)

MultiGPU Support (DataParallel)

Hi Fangzhou,

Thank you for your excellent work. The codebase is well-organized and easy to follow.

When I tried to train mini-imagenet using either 2 - 8 GPUs by the following command,

python train.py --config=configs/convnet4/mini-imagenet/5_way_1_shot/train_reproduce.yaml --gpu=0,1,2,3

python keeps reporting errors shown as below,

meta-train set: torch.Size([5, 3, 84, 84]) (x800), 64
meta-val set: torch.Size([5, 3, 84, 84]) (x800), 16
num params: 32.9K
Traceback (most recent call last):
  File "train.py", line 265, in <module>
    main(config)
  File "train.py", line 130, in main
    logits = model(x_shot, x_query, y_shot, inner_args, meta_train=True)
  File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "**/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "**/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "**/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "**/PyTorch-MAML/models/maml.py", line 223, in forward
    updated_params = self._adapt(
  File "**/PyTorch-MAML/models/maml.py", line 185, in _adapt
    params, mom_buffer = self._inner_iter(
  File "**/PyTorch-MAML/models/maml.py", line 99, in _inner_iter
    grads = autograd.grad(loss, params.values(),
  File "**/lib/python3.8/site-packages/torch/autograd/__init__.py", line 234, in grad
    return Variable._execution_engine.run_backward(
ValueError: grad requires non-empty inputs.

However, the code only works while using 1 GPU. When n_episode=4, I assume the code should work on 2 or 4 GPUs.

Framework Versions:

  • python: 3.8
  • pytorch: 1.10.1 py3.8_cuda11.3_cudnn8.2.0_0

Our ultimate goal is to transfer this repo to our project but find the same errors reported. Any hints or help are highly appreciated. Thanks!

MAML Inductive setting

Hi, I think bringing up the inductive bias of MAML method is very interesting. How exactly could we get the results of MAML in the inductive setting using your implementation?

Should we be setting episodic variable to True of BatchNorm2d class

class BatchNorm2d(nn.BatchNorm2d, Module):
def __init__(self, num_features, eps=1e-5, momentum=0.1, affine=True,
track_running_stats=True, episodic=False, n_episode=4,
alpha=False):
during meta-testing phase. That way, the normalization statistics would be based on a single task and not the mini-batch.

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.