facebookresearch / dino Goto Github PK
View Code? Open in Web Editor NEWPyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
License: Apache License 2.0
PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
License: Apache License 2.0
Hello,
Very interesting work! I was just wondering about the dataset used. Is it the full ImageNet or just the ILSVRC subset?
Thanks!
Tim
First, Thanks for your excellent work on self-supervised ViTs. It is very impressing to see the visulaizations of the attentions. Just a little advice about the evaluation script on Video Object Segmentation. I noticed that the script computes (1+n_last_frames) source frame features everytime when threre is a new target frame, which is unnecessary. In fact we can just put (1+n_last_frames) source frame features into the Queue and instead just compute target frame(current frame to propagate)'s feature, which will save much computation and thus making the evaluation much faster.
Of course this is not the focus of this study. Just a little advice. _
would it be ok to make a web demo for the attention map visualization part on https://gradio.app/ under the current license?
Thanks for your wonderful work!
I notice the amazing transfer learning result in Appendix A.
Can I direct use the code and parameters configuration in DEIT to do fine-tuning and then get the 82.8 top-1 performance on ImageNet?
Is it necessary to change the hyper-parameters of fine-tuning or change the way of data aug? I noticed that DINO and DEIT use different data aug methods.
I use the code and hyper-parameters in DEIT to fine-tuning DINO-base/16, but only got 81.34% top-1 acc on ImageNet. Which is even worse than the from scratch setting 81.8% top-1 .
Hi,
I'm just wondering, is there a way to train this on a single GPU without distributed launch?
Best,
Jason
Hi,
I looked at the exported onnx models with https://netron.app/ but I didn't understand where I can find
the "heads from the last layer of a DeiT-S/8 trained with DINO and display the self-attention for [CLS] token query"
which are mentioned in the description of figure 3., so that I can to the same cool segmentation as showed
in the paper.
Regards Armin
#31 Reopen this issue.
b3h
Hi all, I have problem running the inference script, https://gist.github.com/aquadzn/32ac53aa6e485e7c3e09b1a0914f7422
The images I use are originally grayscale and have been converted to 3 channels. Now the shape of my images are (427, 488, 3)
When I run predict_video(args) I get the following error. Could you please help me here? thanks
I converted the images to 3 channels with the following code.
import cv2
import numpy as np
def saveAsRGB(image_path):
for filepath in sorted(glob.iglob(image_path + "/*")):
img = cv2.imread(filepath)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img2 = np.zeros_like(img)
img2[:,:,0] = gray
img2[:,:,1] = gray
img2[:,:,2] = gray
cv2.imwrite(filepath, img2)
img = cv2.imread ("~input/finalImg_242.png")
print(img.shape) #output : (427, 488, 3)
Also as you could see from the below code, inside predict_video(args) function, I tried to print the shape of the image and confirmed that it has 3 channels ([3, 512, 585]).
import numpy as np
import cv2
import glob
def predict_video(args):
for filepath in sorted(glob.iglob(args.image_path + "/*")):
print("filepath",filepath)
img = Image.open(filepath)
img = img.convert('RGB')
transform = pth_transforms.Compose([
pth_transforms.ToTensor(),
pth_transforms.Resize(512),
pth_transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
])
img = transform(img)
print(".img.shape",img.shape) # output is torch.Size([3, 512, 585])
Error:
RuntimeError Traceback (most recent call last)
in ()
----> 1 predict_video(args)
6 frames
in predict_video(args)
41
42
---> 43 attentions = model.forward_selfattention(img.cuda())
44
45 nh = attentions.shape[1] # number of head
~/vision_transformer.py in forward_selfattention(self, x)
220 B, nc, w, h = x.shape
221 N = self.pos_embed.shape[1] - 1
--> 222 x = self.patch_embed(x)
223
224 # interpolate patch embeddings
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
887 result = self._slow_forward(*input, **kwargs)
888 else:
--> 889 result = self.forward(*input, **kwargs)
890 for hook in itertools.chain(
891 _global_forward_hooks.values(),
~/vision_transformer.py in forward(self, x)
116 def forward(self, x):
117 B, C, H, W = x.shape
--> 118 x = self.proj(x).flatten(2).transpose(1, 2)
119 return x
120
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
887 result = self._slow_forward(*input, **kwargs)
888 else:
--> 889 result = self.forward(*input, **kwargs)
890 for hook in itertools.chain(
891 _global_forward_hooks.values(),
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py in forward(self, input)
397
398 def forward(self, input: Tensor) -> Tensor:
--> 399 return self._conv_forward(input, self.weight, self.bias)
400
401 class Conv3d(_ConvNd):
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py in _conv_forward(self, input, weight, bias)
394 _pair(0), self.dilation, self.groups)
395 return F.conv2d(input, weight, bias, self.stride,
--> 396 self.padding, self.dilation, self.groups)
397
398 def forward(self, input: Tensor) -> Tensor:
RuntimeError: Given groups=1, weight of size [384, 3, 8, 8], expected input[1, 0, 512, 585] to have 3 channels, but got 0 channels instead
I found an error that happens in the interpolation method.
It looks like that this method only work with square images.
To fix this, we need different scale factor as in the forward_selfattention method.
I am using the following dataset. It is a subset of ImageNet.
https://github.com/rmccorm4/Tiny-Imagenet-200
After training one step, the loss just becomes nan and stops training. Have you experienced this problem? And how do you solve it?
Best regards!
Kindly provide the instructions for custom data training and inference.
Hi @mathildecaron31! Thanks for the great package and concise codebase :)
The global crops haver resolution 224 x 224
and local are 96 x 96
:
Line 436 in 58aabc0
Was it important that the local ones are kept 96 x 96
?
I'm building an extension and the code of course becomes simpler if all views are 224 x 224
(since you can collate everything together and need not separate forward passes per resolution), but I wonder if you thing that it would impact the accuracy (besides slowed computation time).
Thanks!
I dont understand which of the two models are later used for inference is it the student or teach? Are the pretrained weights provided from the teacher or the student network?
Hello, thanks for sharing your great work with codes!
I was wondering if you experimented with no sharpening and no centering at all? I was thinking if using either one causes some collapse, why not use none of them during training.
Thanks,
Jaejin Cho
None of the pre-trained models is available in PyTorch hub the code raises an HTTPError: HTTP Error 404: Not Found.
This is an excellent work.
About the downstream task CoCo detection:
I want to know if we have the result(map) on coco dataset.
Thanks.
Hello,
Do you have plans to check the performance of smaller models like mobilenet_v2 or v3 (they are available in torchvision_models)?
If no, I may look into this task. Do you think that small CNN like mobilenet (with depth 1 or even 0.35) is capable of learning these representations?
Hi,
Thanks for sharing your implementation of dino.
We are trying to figure out how do you set the size of the patches when evaluating on Davis.
You said in the paper that the input resolution is 480 for Davis and images are not square.
Or, the VIT model seems to be pre-trained with an input resolution fo 224 and patch size of 16.
So there is a kind of sequence length mismatch between the pre-training and the test.
Can you give a more detail on this ?
torch.hub.load('facebookresearch/dino:main', 'dino_resnet50')
Using cache found in /Users/thomas/.cache/torch/hub/facebookresearch_dino_main
Traceback (most recent call last):
File "", line 1, in
File "/Users/thomas/Documents/GitHub/lightning-flash/.venv/lib/python3.7/site-packages/torch/hub.py", line 339, in load
model = _load_local(repo_or_dir, model, *args, **kwargs)
File "/Users/thomas/Documents/GitHub/lightning-flash/.venv/lib/python3.7/site-packages/torch/hub.py", line 368, in _load_local
model = entry(*args, **kwargs)
File "/Users/thomas/.cache/torch/hub/facebookresearch_dino_main/hubconf.py", line 82, in dino_resnet50
model.load_state_dict(state_dict, strict=True)
File "/Users/thomas/Documents/GitHub/lightning-flash/.venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ResNet:
Missing key(s) in state_dict: "fc.weight", "fc.bias".
Hi, I am downloading small model like in README
import torch
deits8 = torch.hub.load('facebookresearch/dino:main', 'dino_deits8
and getting this error
Downloading: "https://github.com/facebookresearch/dino/archive/main.zip" to C:\Users\Igor/.cache\torch\hub\main.zip
Downloading: "https://dl.fbaipublicfiles.com/dino/dino_deitsmall8_pretrain/dino_deitsmall8_pretrain.pth" to C:\Users\Igor/.cache\torch\hub\checkpoints\dino_deitsmall8_pretrain.pth
11%
8.96M/82.7M [00:14<01:56, 663kB/s]
---------------------------------------------------------------------------
ConnectionResetError Traceback (most recent call last)
<ipython-input-22-fae2c58f62a6> in <module>
1 import torch
----> 2 deits8 = torch.hub.load('facebookresearch/dino:main', 'dino_deits8')
~\anaconda3\lib\site-packages\torch\hub.py in load(repo_or_dir, model, *args, **kwargs)
368 repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose)
369
--> 370 model = _load_local(repo_or_dir, model, *args, **kwargs)
371 return model
372
~\anaconda3\lib\site-packages\torch\hub.py in _load_local(hubconf_dir, model, *args, **kwargs)
397
398 entry = _load_entry_from_hubconf(hub_module, model)
--> 399 model = entry(*args, **kwargs)
400
401 sys.path.remove(hubconf_dir)
~/.cache\torch\hub\facebookresearch_dino_main\hubconf.py in dino_deits8(pretrained, **kwargs)
30 model = vits.__dict__["deit_small"](patch_size=8, num_classes=0, **kwargs)
31 if pretrained:
---> 32 state_dict = torch.hub.load_state_dict_from_url(
33 url="https://dl.fbaipublicfiles.com/dino/dino_deitsmall8_pretrain/dino_deitsmall8_pretrain.pth",
34 map_location="cpu",
~\anaconda3\lib\site-packages\torch\hub.py in load_state_dict_from_url(url, model_dir, map_location, progress, check_hash, file_name)
553 r = HASH_REGEX.search(filename) # r is Optional[Match[str]]
554 hash_prefix = r.group(1) if r else None
--> 555 download_url_to_file(url, cached_file, hash_prefix, progress=progress)
556
557 if _is_legacy_zip_format(cached_file):
~\anaconda3\lib\site-packages\torch\hub.py in download_url_to_file(url, dst, hash_prefix, progress)
445 unit='B', unit_scale=True, unit_divisor=1024) as pbar:
446 while True:
--> 447 buffer = u.read(8192)
448 if len(buffer) == 0:
449 break
~\anaconda3\lib\http\client.py in read(self, amt)
456 # Amount is given, implement using readinto
457 b = bytearray(amt)
--> 458 n = self.readinto(b)
459 return memoryview(b)[:n].tobytes()
460 else:
~\anaconda3\lib\http\client.py in readinto(self, b)
500 # connection, and the user is reading more bytes than will be provided
501 # (for example, reading in 1k chunks)
--> 502 n = self.fp.readinto(b)
503 if not n and b:
504 # Ideally, we would raise IncompleteRead if the content-length
~\anaconda3\lib\socket.py in readinto(self, b)
667 while True:
668 try:
--> 669 return self._sock.recv_into(b)
670 except timeout:
671 self._timeout_occurred = True
~\anaconda3\lib\ssl.py in recv_into(self, buffer, nbytes, flags)
1239 "non-zero flags not allowed in calls to recv_into() on %s" %
1240 self.__class__)
-> 1241 return self.read(nbytes, buffer)
1242 else:
1243 return super().recv_into(buffer, nbytes, flags)
~\anaconda3\lib\ssl.py in read(self, len, buffer)
1097 try:
1098 if buffer is not None:
-> 1099 return self._sslobj.read(len, buffer)
1100 else:
1101 return self._sslobj.read(len)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
Any possible solutions? What if I download this model via browser and put it into torch cache folder, will it work?
As the title implies
It seems like the script was designed to only run on GPU:
Line 106 in 6f51bb1
Wouldn't it be more flexible if it was instead:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
Hello,
I can't access https://dl.fbaipublicfiles.com/dino/dino_resnet50_pretrain/dino_resnet50.onnx link from the README. It Would be useful for our research. Any help appreciated.
can a colab for inference be added please
Dear authors,
thank you very much for this repo. I would like to use the pre-trained weights and finetune the network for a different dataset using self-supervised learning with DINO.
However, when I try to use the code, I cannot load the optimizer's state and I get the following output:
Found checkpoint at ./dino_ft_workdir/checkpoint.pth
=> loaded student from checkpoint './dino_ft_workdir/checkpoint.pth' with msg <All keys matched successfully>
=> loaded teacher from checkpoint './dino_ft_workdir/checkpoint.pth' with msg <All keys matched successfully>
=> loaded optimizer from checkpoint './dino_ft_workdir/checkpoint.pth'
=> failed to load fp16_scaler from checkpoint './dino_ft_workdir/checkpoint.pth'
=> failed to load dino_loss from checkpoint './dino_ft_workdir/checkpoint.pth'
fp16_scaler and dino_loss are not in the checkpoint so it is clear why they are not loaded.
I found out that the problem with the optimizer is caused by this:
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group.
Is there anyone who would be able to help me?
Thank you very much in advance!
Hi, @mathildecaron31. I have a question about a detail in the paper. In Appendix B, Relation to SwAV (Table 14), the paper did an ablative study in terms of composing parts in terms of Momentum encoder and Extra operation. I'm wondering Whether softmax over channel dimension is appended for experiment except Centering (i.e., experiment 2,3,5,6). Since DINO uses an extra channel-wise softmax after Centering to compute the loss, while in SwAV, outputs after batch-level softmax/SK algorithm are used to compute CE.
For example:
Lines 55 to 61 in 8aa93fd
Is that intended (i.e. the model will not work if patch_size=16
)? If not I'd be happy to contribute a PR to fix that.
I ran
deits16 = torch.hub.load('facebookresearch/dino:main', 'dino_deits16')
to retrieve the pretrained model. This takes an image and outputs a vector of length 384. What is this vector? Is this a representation of the image? And if it is, can I use this pretrained network to create representations of images/patches that I can use for clustering?
Secondly,
in the visualize_attention.py file from line 199 - 206 we save images for all attention heads in a for loop. How are the images in the paper generated then as those are single images? Are they a combination of the heatmaps outputted by the model? And if they are, how are they combined? By average or by summation?
Hi, I wonder that how to use "python visualize_attention.py" to process all the images in my folder.
I've tried to use "python visualize_attention.py --output_dir /dino/out --image_path ... " but it can only process 1 image.
Because I want to use the processed images to multiply with the original images then get the background-removed data.
If there is any way, please tell me. Thanks!
I have tried to run the eval_linear.py after training dino on a custom dataset. I get the folowing error:
`
Traceback (most recent call last):
File "/home/ubuntu/dino/eval_linear.py", line 250, in
eval_linear(args)
File "/home/ubuntu/dino/eval_linear.py", line 142, in eval_linear
"Top-1 test accuracy: {acc:.1f}".format(acc=max_accuracy))
NameError: name 'max_accuracy' is not defined
Killing subprocess 44659
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/pytorch/bin/python', '-u', 'eval_linear.py', '--local_rank=0', '--data_path', 'images', '--pretrained_weights', 'runs/checkpoint.pth']' returned non-zero exit status 1.
`
After skimming through the lines above it, I believe it should be best_acc rather than max_accuracy ?
Note: I have changed the dataset class (pytorch dataset) so as to work on my custom dataset but I believe the error still stands as I tried finding max_accuracy in the entire file and found only one occurrence (i.e., not defined earlier).
Line 316 in 534f37f
Should be
param_norms = utils.clip_gradients(student, args.clip_grad)
How to extract the immediate layer feature in Vit_b8 model?
I tried:
model = torch.hub.load('facebookresearch/dino:main', 'dino_vitb8')
modules = list(model.children())[:-2]
model = nn.Sequential(*modules)
but model(x.to(device)) error occur: forward() takes 1 argument but 2 were given
Hi all, I am trying to execute visualize_attention.py with default pretrained weights on my own image as below
!python visualize_attention.py --image_path 'test/finalImg_249.png'
I get size mistamatch error. Could you please let me know what changes needs to be done here?
Error stack trace:
Please use the --pretrained_weights
argument to indicate the path of the checkpoint to evaluate.
Since no pretrained weights have been provided, we load the reference pretrained DINO weights.
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3458: UserWarning: Default upsampling behavior when mode=bicubic is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
"See the documentation of nn.Upsample for details.".format(mode)
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3503: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
"The default behavior for interpolate/upsample with float scale_factor changed "
Traceback (most recent call last):
File "visualize_attention.py", line 162, in
attentions = model.forward_selfattention(img.to(device))
File "~/dino/vision_transformer.py", line 246, in forward_selfattention
x = x + pos_embed
RuntimeError: The size of tensor a (3234) must match the size of tensor b (3181) at non-singleton dimension 1
Image details:
import cv2
img = cv2.imread('finalImg_249.png')
print (img.shape) #output: (427, 488, 3)
Here is a quote from this comment #8 (comment):
As a matter of fact, on copy detection datasets, I've found the base models to perform clearly better than the small ones: I get better performance with Base16x16 than with Small8x8 though Small8x8 is better at k-NN ImNet.
I assume this is about Table 2 from the article.
We see that for both of the tasks (Linear and k-NN):
However, when it comes to comparing the /8 architectures:
What is the rationale behind DeiT-S/8 being better than ViT-B/8 at the k-NN ImNet task?
Compared with DeiT/ViT in a supervised learning setting, it seems like your PatchEmbed does not have a norm layer. Can you explain your choice?
In this task, how do you propagate the label from previous frames?
Specifically,
Your work looks very interesting.
I'm not familiar with Pytorch / Python and it would be great if the pre-trained nets could be provided in ONNX format.
Regards Armin
I use custom data to train DINO, the model seems collapsed after a few steps, the feature seems to be uniform. I use larger teacher temputure to enhance "sharping", but the model collapsed after all. I wonder if DINO is sensitive to the data, in other word, does DINO tend to collapse when training at differnet data?
I was wondering about the attention maps used for visualizing the supervised training model. As far as I can understand, in the source code last layer attention weights are used to visualize for saliency masks. Is this same approach used for visualizing the supervised model for the visual comparison we have in Figure 4 of the paper?
If so, in papers such as Quantifying Attention Flow in Transformers it is argued that final attention maps can't be directly mapped back to input tokens in a meaningful way since information is mixed during forward propagation of multiple self-attention blocks. What are your views on this?
Definitely, having saliency maps as a byproduct of self-supervised training is way more valuable than supervised training in the sense of zero shot learning. I was curious if last layer attention maps are used during supervised visualizations wouldn't it be more fair to use an approach like attention-flow instead? Also, would using this approach give different results for ViTs trained with DINO? Also, I am not sure if we can make sense of different heads with attention flow approach, and its pretty cool to see that different heads are able to localize into different objects in the case of DINO.
Thank you! :)
Hi all, I'm running into an error when trying to fine-tune from one of the pretrained checkpoints.
Code
!mkdir "$output"
!wget -q -O "$output/checkpoint.pth" https://dl.fbaipublicfiles.com/dino/dino_deitsmall16_pretrain/dino_deitsmall16_pretrain.pth
!python -m torch.distributed.launch \
--nproc_per_node=1 ./dino/main_dino.py \
--arch deit_small \
--data_path "$input" \
--output_dir "$output"
Error
| distributed init (rank 0): env://
git:
sha: 8aa93fdc90eae4b183c4e3c005174a9f634ecfbf, status: clean, branch: main
arch: deit_small
batch_size_per_gpu: 64
...
...
Student and Teacher are built: they are both deit_small network.
Loss, optimizer and schedulers ready.
Found checkpoint at ./drive/MyDrive/DINO/checkpoint.pth
=> failed to load student from checkpoint './drive/MyDrive/DINO/checkpoint.pth'
=> failed to load teacher from checkpoint './drive/MyDrive/DINO/checkpoint.pth'
=> failed to load optimizer from checkpoint './drive/MyDrive/DINO/checkpoint.pth'
=> failed to load fp16_scaler from checkpoint './drive/MyDrive/DINO/checkpoint.pth'
=> failed to load dino_loss from checkpoint './drive/MyDrive/DINO/checkpoint.pth'
Any suggestions would be very much appreciated.
@mathildecaron31 I have a question about copy detection. I am trying to evaluate the pretrained DINO models on a dataset for copy detection task and I am trying to follow the steps from the paper. Even with different image input sizes in Table 4 we see that final embedding dimension is 1536. I am not able to understand how we can get same embedding dimension after concatenating CLS embedding and GeM pooled output patch tokens for different input image sizes. Maybe I am missing a point here. Here is what I did:
Added the following method to VisionTransformer
to return output patch tokens and cls output.
def forward_output_patch_tokens_cls(self, x):
B = x.shape[0]
x = self.patch_embed(x)
cls_tokens = self.cls_token.expand(B, -1, -1)
x = torch.cat((cls_tokens, x), dim=1)
pos_embed = self.interpolate_pos_encoding(x, self.pos_embed)
x = x + pos_embed
x = self.pos_drop(x)
for blk in self.blocks:
x = blk(x)
if self.norm is not None:
x = self.norm(x)
return x
Using GeM module from here
def gem(x, p=3, eps=1e-6):
"x: BS x num tokens x embed_dim"
return F.avg_pool1d(x.clamp(min=eps).pow(p), (x.size(-1))).pow(1./p)
class GeM(nn.Module):
def __init__(self, p=3, eps=1e-6):
super(GeM,self).__init__()
self.p = nn.Parameter(torch.ones(1)*p)
self.eps = eps
def forward(self, x):
return gem(x, p=self.p, eps=self.eps)
def __repr__(self):
return self.__class__.__name__ + '(' + 'p=' + '{:.4f}'.format(self.p.data.tolist()[0]) + ', ' + 'eps=' + str(self.eps) + ')'
Collect embeddings (CLS + GeM Pooled Output Patch Tokens)
all_image_features = []
with torch.no_grad():
for imgb in progress_bar(image_dl):
outputs = model.forward_output_patch_tokens_cls(imgb.cuda())
cls_token, output_patch_tokens = outputs[:,0],outputs[:,1:]
cls_features = cls_token
patch_features = gem_pooling(output_patch_tokens.permute(0,2,1)).squeeze(-1)
concat_features = torch.cat([cls_features,patch_features],dim=-1)
all_image_features.append(concat_features.cpu())
Following this and using an image size of 224 for dino_vitb8
my final embedding dimension is 1568 1536. Which can also be calculated as:
cls_feature_dim*2 = 768*2
Question
Also, during copy detection task do you learn the pooling parameter p
or is it picked based on validation set? I didn't quite understand the whitening part is it same as regular unsupervised PCA?
Found this paper: https://hal.inria.fr/hal-00722622v2/document. I believe idea is coming from here.
Edit:
Figured out the 1536 dimension size. We need to pool across token positions, so this gives pooled embedding with same dimension as cls token embedding dimension.
Originally posted by @KeremTurgutlu in #8 (comment)
I notice the generation of positional embedding in interpolate_pos_encoding
method is slightly different than the one in the forward_selfattention
method. The following simple modification bring both into the same page, to your interest.
def interpolate_pos_encoding(self, x, pos_embed, w, h): # passing w and h as arguments
npatch = x.shape[1] - 1
N = pos_embed.shape[1] - 1
if npatch == N:
return pos_embed
class_emb = pos_embed[:, 0]
pos_embed = pos_embed[:, 1:]
dim = x.shape[-1]
w0 = w // self.patch_embed.patch_size # just copy paste from forward_selfattention
h0 = h // self.patch_embed.patch_size
pos_embed = nn.functional.interpolate(
pos_embed.reshape(1, int(math.sqrt(N)), int(math.sqrt(N)), dim).permute(0, 3, 1, 2),
scale_factor=(w0 / math.sqrt(N), h0 / math.sqrt(N)), # replace math.sqrt(npatch / N) with one from forward_selfattention
mode='bicubic',
)
pos_embed = pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
return torch.cat((class_emb.unsqueeze(0), pos_embed), dim=1)
Thanks for your excellent work!
And in your paper, you made a lot of comparisons with BYOL, Moco and Swav,
I am grateful if you can release your implement code of BYOL, moco and swav.
Couldn't they be represented as:
GaussianBlur = lambda p, radius_min = 0.1, radius_max = 2: T.RandomApply([T.GaussianBlur(kernel_size = 3, sigma = (radius_min, radius_max))], p = p)
Solarization = lambda p, threshold = 0.5: T.RandomSolarize(threshold, p = p)
Thanks!
Dear Authors,
Thank you for this amazing work and repository. I wanted to see the clustering capability of the pretrained network on imagenet on CIFAR-10 dataset. For doing so I wanted to use the features you extract in the eval_knn.py script. Below are the parameters I used to run the code:
arch= deit_small (i.e vit_small after the new commit)
patch_size=16
batch_size_per_gpu = 32
data_path = 'path_to_cifar10_dataset'
and all other params are set to their default values.
I left the pretrained_weights
as blank so that the model loads the weights of imagenet from the url mentioned in utils.py
The script executed successfully with the message:
Data loaded with 50000 train and 10000 val imgs.
Model deit_small 16x16 built.
Please use the `--pretrained_weights` argument to indicate the path of the checkpoint to evaluate.
Since no pretrained weights have been provided, we load the reference pretrained DINO weights.
Pretrained weights found at dino_deitsmall16_pretrain/dino_deitsmall16_pretrain.pth and loaded with msg: <All keys matched successfully>
Extracting features for train set...
Storing features into tensor of shape torch.Size([50000, 384])
However, I'm facing a strange issue, the features extracted are coming out to be a zero vector, for all of the input image.
torch.Size([50000, 384])
torch.Size([50000])
tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0')
Can you please specify where I might be going wrong.
Thanks in advance!
Wondering how long does it take to run the linear_eval with the default setting?
It's a so impressive work and thanks for your code. Regarding the results in the table of Figure 4, I want to ask how to generate the multi-label segmentation maps, ie, how to associate the self-attention maps to different classes.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.