Git Product home page Git Product logo

mae's Introduction

Due to limit resource available, we only test the model on cifar10. We mainly want to reproduce the result that pre-training an ViT with MAE can achieve a better result than directly trained in supervised learning with labels. This should be an evidence of self-supervised learning is more data efficient than supervised learning.

We mainly follow the implementation details in the paper. However, due to difference between Cifar10 and ImageNet, we make some modification:

  • we use vit-tiny instead of vit-base.
  • since Cifar10 have only 50k training data, we increase the pretraining epoch from 400 to 2000, and the warmup epoch from 40 to 200. We noticed that, the loss is still decreasing after 2000 epoches.
  • we decrease the batch size for training the classifier from 1024 to 128 to mitigate the overfitting.

Installation

pip install -r requirements.txt

Run

# pretrained with mae
python mae_pretrain.py

# train classifier from scratch
python train_classifier.py

# train classifier from pretrained model
python train_classifier.py --pretrained_model_path vit-t-mae.pt --output_model_path vit-t-classifier-from_pretrained.pt

See logs by tensorboard --logdir logs.

Result

Model Validation Acc
ViT-T w/o pretrain 74.13
ViT-T w/ pretrain 89.77

Weights are in github release. You can also view the tensorboard logs at tensorboard.dev.

Visualization of the first 16 images on Cifar10 validation dataset:

avatar

mae's People

Contributors

icaruswizard avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mae's Issues

resource requirement

我想请问一下学长你目前复现这个工作用的什么GPU?大概要跑完要多久?

权重读取后重建图像不理想

您好,我使用您提供的预训练模型mae-t-vit.pt进行测试时,输入如下一张图片,重建图像却不理想,请问是什么原因造成的呢?

image

我的代码如下:

import torch
from torchvision import transforms
import cv2
from PIL import Image
import random
import numpy as np

def setup_seed(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True

setup_seed()

# img = cv2.imread('benign_image.png')
img = Image.open('10.png')

trans = transforms.Compose([
    transforms.ToTensor()
])

img = trans(img).unsqueeze(0)

model = torch.load('vit-t-mae.pt', map_location='cpu')
outs = model(img)

transforms.ToPILImage()(outs[0].squeeze(0)).save('result.png')

What are the rules for setting the parameters of vit-tiny's decoder?

Thanks for your work! I’m pretraining the vit-tiny for my dataset, about 260k images. But i can not determine the setting for decoder's parameters (depth/embed_dim/num_heads), just consistent with vit-base/large/huge or choose some smaller value to make a lightweight decoder? Due to the limitation of gpus, i can not try many times. Could you give me some suggestions, thanks a lot. :)

Unable to reconstruct a distinguishable image

Sorry to bother you, I tried to reconstruct the single-channel CT image (512x512), but the MSE loss did not decrease any more and remained at 0.03 when the epoch was 1000, the reconstructed image quality was very poor, what was the cause of this problem? I'm not familiar with Transformer, so I just tried adjusting the patch size (2->16), embedding dimensions(192->768) and the number of encoder/decoder heads (12).

Fine-tuning and linear evaluation

Thank you so much for this simple yet effective code!

Excuse me because I'm still new to this. My questions are about train_classifier.py code.
1- Does it do fine-tuning or linear evaluation?
2- Assuming it does one of them, how to toggle to do the other? .. what should I change?
3- does it use the encoder or the decoder to do this?

Excuse my basic questions, but you answer will be really appreciated.

TypeError: __init__() got an unexpected keyword argument 'verbose'

你好,我在运行python mae_pretrain.py的时候遇到了这个问题:
Files already downloaded and verified
Files already downloaded and verified
Traceback (most recent call last):
File "mae_pretrain.py", line 44, in
lr_scheduler = torch.optim.lr_scheduler.LambdaLR(optim, lr_lambda=lr_func, verbose=True)
TypeError: init() got an unexpected keyword argument 'verbose'
请问这个应该怎么处理呢?

how to make the masked patches not random

Hi, is there any possible way to fix the position of masked patches? (for example. make all masked patches together into the middle of the image instead of spreading round the image randomly)

Thank you for your help!

Ask some questions

Dear Dr.Zhang, If there are no labels in my trainset.Can I remove the cls_token? If do, will it influence the training process?

Custom masking

Hi, thanks for the code.
You answered that we can modify the PatchShuffle class to create custom masks. However, the patch shuffle class takes the output of a Conv2d layer, making it hard to know precisely what part of the image we are masking. Is there any reason for this?

Originally posted by @wenhaowang1995 in #14 (comment)

What are the versions you worked with, keep getting errors

Hi, I'm getting a lot of errrors like:
AttributeError: 'Mlp' object has no attribute 'drop1'
or
AttributeError: 'Block' object has no attribute 'drop_path1'
Its probably something non compatible between timm and pytorch, any idea whta are the versions you worked with?
It happend only in train_classifier.py.
Thanks!

Turn off masking

Hi @IcarusWizard,

Great work!!

We are trying to correct segmentation label using masked Autoencoder. I tried MAE and its working good.
Now for inference I want to pass whole image(without being masked) and check if MAE is able to correct certain regions.
Is there a way where I can turn off mask and infer on whole image?

let me know if my question is unclear.

Thanks

train_classifier

Hello,
Why did we not drop the decoder and insert a new linear layer at the end of the norm layer in the train classifier for fine-tuning? Why did you get each module of the pretrained model individually and try to reconstruct the fine-tuner instead of dropping the decoder module?

Suspected minor bug

Hi, there is a minor bug in the MAE_Decoder implementation.
The input to the forward function (line 103 in model.py) has features.shape[0] == 1 + t (where 1 refers to the cls_token, and t the number of unmasked tokens).
Then, in line 117: mask[T:] = 1, should actually be mask[T-1:] = 1, since the mask was created without accounting for the cls_token.

extracting embeddings

Hi, how would you recommend extracting (1D) image embeddings from the model on (say) a held out data set?

Need help for reconstracting your experiments

Hi !
I install all the requirments by pip and when I run MAE code, here comes up with an error:

Traceback (most recent call last):
File "D:\mywork\Projects\PJ1\my_MAE_cifar\main.py", line 87, in <module>
rep, backward_indices = encoder(img0)
File "D:\Python\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\mywork\Projects\PJ1\my_MAE_cifar\model.py", line 80, in forward
features = self.layer_norm(self.transformer(patches))
File "D:\Python\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\Python\lib\site-packages\torch\nn\modules\container.py", line 141, in forward
input = module(input)
File "D:\Python\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\Python\lib\site-packages\timm\models\vision_transformer.py", line 268, in forward
x = x + self.drop_path1(self.ls1(self.attn(self.norm1(x))))
File "D:\Python\lib\site-packages\torch\nn\modules\module.py", line 1185, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Block' object has no attribute 'drop_path1'. Did you mean: 'drop_path'?

This seems to be a problem comes from TIMM package

While I look into the vision_transformer.py image of TIMM package, it seems there is no problem with the drop_path1 and drop_path2.

Therefore could you please tell me which version of TIMM are you using? It is best if you could provide your torch and cuda version

Expected dtype int64 for index

sorry maybe it's a stupid question - I'm new to torch .... I experience the below issue when trying first step, please help. Thanks.

C:\Python\MAE>python mae_pretrain.py
Files already downloaded and verified
Files already downloaded and verified
Adjusting learning rate of group 0 to 1.2000e-05.
0%| | 0/98 [00:00<?, ?it/s]
Traceback (most recent call last):
File "C:\Python\MAE\mae_pretrain.py", line 54, in
predicted_img, mask = model(img)
File "C:\Users\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Python\MAE\model.py", line 141, in forward
features, backward_indexes = self.encoder(img)
File "C:\Users\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Python\MAE\model.py", line 70, in forward
patches, forward_indexes, backward_indexes = self.shuffle(patches)
File "C:\Users\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Python\MAE\model.py", line 33, in forward
patches = take_indexes(patches, forward_indexes)
File "C:\Python\MAE\model.py", line 18, in take_indexes
return torch.gather(sequences, 0, repeat(indexes, 't b -> t b c', c=sequences.shape[-1]))
RuntimeError: gather(): Expected dtype int64 for index

Environtment:
Windows 11
C:\Python\MAE>python --version
Python 3.9.9

Why do the masks' positions on the visualized images are same?

I run the pretrain program and saw the visualized images on tensorboard. I found that the masks of those 16 images are exactly the same as yours. So I was wondering if it isn't actually random shuffled?
The picture with time is the visualization of mine, and the other is yours.
问题1
问题2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.