hobbitlong / cmc Goto Github PK

View Code? Open in Web Editor NEW

1.3K 28.0 177.0 76 KB

[ECCV 2020] "Contrastive Multiview Coding", also contains implementations for MoCo and InstDis

License: BSD 2-Clause "Simplified" License

Python 100.00%

cmc's People

Stargazers

Watchers

Forkers

codeaudit hyzcn prpankajsingh phongnhhn92 shubhampachori12110095 atong01 ionvision phecy mikigom meyerjo neuroailab jingcx ruppeshnalwaya1993 yyht dandelin seeker1943 justinmae yusongking leix28 forks-learning magauiya tangshixiang cuthbertcai mandelapatrick ssnl hzhang57 srinidhipy zhouyao4321 wanboyang superxudou miraclebiu tangzj liulingzhi604 billyes samsgood0310 cycle13 wenbinlee winwinjjiang dotpyu brucew91 dict largefishpku clionaod suzy0223 oztc l1aoxingyu vyraun cross32768 phymucs jiahao000 shaotengliu kappa666 wangsiwei2010 cxz chaoso pandinosaurus june01 bruinxiong qingsong99 angranli raphaeldu xuchensjtu bestjuly janeyzy can-it-run liuguoyou alan-qin precsys dongsky poodarchu mhw32 sundoge ajabri couver-v slowbull amsword gztangde sumanapuppala yuanwei0908 jnyjxn zhangkai2017 yejinyou juampatronics pengbo-o pgsrv liyantett wp8619 joskaaaa jizongfox jacobswan1 annatruzzi youtang1993 keving7878 askintution fanyangmeng sean0719 trungtv-vti yuanwanglll yaminibansal andrewliao11

cmc's Issues

Why use learning rate 30~50

I saw your note and it seems rather unusual to use such a large learning rate:

Note: When training linear classifiers on top of ResNets, it's important to use large learning rate, e.g., 30~50.

Is there something I'm missing? I can't imagine how you get stable gradient descent with such high learning rates.

Pre-trained weights for linear classifier available?

Hey there, thanks for the well-documented code!

Quick question: Am I correctly assuming that in order to evaluate the model on the 1,000-class ImageNet validation dataset one has to train the linear classifier first (using LinearProbing.py)? If so, would it be possible to release pre-trained weights for the classifier as well, such that one can use classifier.load_state_dict(checkpoint['classifier'])?

How to visualize ab channel of images like that in your paper

Hi,
Thanks for your code. I just wonder how to visualize the AB channel of images in the code as shown in your paper. I could visualize L channel using TensorboardX, but that doesn't work for AB channel.

the initial value of memory in Memory Bank of InsDis and CMC

Why initialize memory value this way? Thanks!

stdv = 1. / math.sqrt(inputSize / 3)
self.register_buffer('memory', torch.rand(self.queueSize, inputSize).mul_(2 * stdv).add_(-stdv))

shuffle-bn has no effect on single-GPU

It appears to me that shuffle-bn has no effect, when run on a single GPU.

Example:

import torch
import torch.nn as nn

(B,C,H,W) = 4,3,2,2

model1 = nn.Sequential(nn.BatchNorm2d(C))
model2 = nn.Sequential(nn.BatchNorm2d(C))
print("Before:")
print("  model1 stats: ", model1[0].running_mean, model1[0].running_var)
print("  model2 stats: ", model2[0].running_mean, model2[0].running_var)
shuffle_ids = torch.randperm(B).long()
x1 = torch.randn(B,C,H,W)*3+1
x2 = x1[shuffle_ids]
model1(x1)
model2(x2)
print("After:")
print("  model1 stats: ", model1[0].running_mean, model1[0].running_var)
print("  model2 stats: ", model2[0].running_mean, model2[0].running_var)

Before:
  model1 stats:  tensor([0., 0., 0.]) tensor([1., 1., 1.])
  model2 stats:  tensor([0., 0., 0.]) tensor([1., 1., 1.])
After:
  model1 stats:  tensor([0.2285, 0.1523, 0.1447]) tensor([1.6193, 1.4863, 1.6332])
  model2 stats:  tensor([0.2285, 0.1523, 0.1447]) tensor([1.6193, 1.4863, 1.6332])

I guess another approach is necessary on single-GPU. Any thoughts?

Thanks for releasing this code.

Is there a script for the unsupervised ResNet-101？

Why AvgPool2d instead of AdaptiveAvgPool2d for ResNet?

Thank you for sharing this project with us! I am curious why did you change the default AdaptiveAvgPool2d of ResNet to AvgPool2d. How does this change affect the performance?

Your AvgPool2d layer:
https://github.com/HobbitLong/CMC/blob/master/models/resnet.py#L124

Pytorch's AdaptiveAvgPool2d layer:
https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py#L153

A question about the dataset.

Dear authors,

I just read your paper recently, and I think it is really interesting and significant.
So I want to see some details about the method by running the code.

I mainly focus on graph representation learning, recommendation, and ML.
I am not familiar with the image dataset and processing.

Could you provide me the datasets to run the code? Or where can I download the Imagenet 100 and STL-100 dataset??

Thanks.

Xu Chen

Is it possible to train CMC by loading pretrained resnet-50 model on ImageNet?

I enjoyed reading the paper. Thanks for open sourcing the code.

Please let me know if I can train CMC model [resnet50 variant] by loading pretrained resnet-50 trained on ImageNet.

Also if I want to train with custom dataset with custom number of classes, please suggest what change is required in hyperparams?

Estimating normalization factor Z

CMC/NCE/NCEAverage.py

Line 189 in 0f72b18

self.params[0] = out.mean() * self.outputSize

This one-time estimation is problematic, especially if the dictionary is not random noise. Computing Z as a moving average of this would give a more reasonable result.

Code to reproduce NYU RGBD results / input pipeline

Hi,
thanks for your repo.
It would be nice if you could provide the code / the input pipeline which you used to run the NYU RGB-D experiments as well (similar to #4 ). To me it is not entirely clear, how you added the different modalities.
Best,

TenCrop Results

Is there any ten crop results? As I know, some methods will improve a lot with ten crop, but some may only improve a little. I wonder how much improvement can be get with ten crop in CMC.

Unable to reproduce full Imagenet accuracies of pretrained weights for CMC Resnet50v2 and MoCo

Hi @HobbitLong,

Thanks for such a clean and readable code.

I am interested in using the pre-trained weights that you were kind enough to provide. I downloaded the pre-trained weights CMC_resnet50v2.pth and MoCo_softmax_16384_epoch200.pth. Then, I ran the linear evaluation code with the following commands, but couldn't reproduce the accuracies. The accuracies at the final, 60th, epoch for CMC and MoCo are 62.0% and 57.3% respectively. The accuracies should be 64.1% (from the CMC paper) and 59.4% (from readme).

CUDA_VISIBLE_DEVICES=9 python LinearProbing.py --dataset imagenet \
 --data_folder /datasets/imagenet_nfs1 \
 --save_path ./output/cmc_linear \
 --tb_path ./output/cmc_linear \
 --model_path ./pretrained/CMC_resnet50v2.pth \
 --model resnet50v2 --learning_rate 30 --layer 6

CUDA_VISIBLE_DEVICES=8 python eval_moco_ins.py --dataset imagenet \
 --data_folder /datasets/imagenet_nfs1 \
 --save_path ./output/moco_linear \
 --tb_path ./output/moco_linear \
 --model_path ./pretrained/MoCo_softmax_16384_epoch200.pth \
 --model resnet50 --learning_rate 30 --layer 6

Have I missed something? Do I need to change the default hyperparameters to get the reported numbers?

Thanks

For loops in AliasMethod

https://github.com/HobbitLong/CMC/blob/master/NCE/alias_multinomial.py#L8

Hi!

While reading your code, I've noticed that for loops in the initialization function of AliasMethod causes a lot of computation.

However, the only entry (https://github.com/HobbitLong/CMC/blob/master/NCE/NCEAverage.py#L13) instantiating the class is passing torch.ones, which results in ones self.prob and zeros self.alias in AliasMethod.

What could go wrong if I let them just ones and zeros instead of running for loops while initializing AliasMethod?

Thanks for sharing the code :) (and RepDistiller too!)

Implementing CMC on CIFAR-10

Hi @HobbitLong, I am trying to implement CMC on CIFAR-10 with a shallow ResNet. However, the accuracy only reaches 60%~70%. I have tried to tune the batch size from 64 to 512 and learning rate from 0.01 to 0.12. In addition, I also tuned the nce_k from 8192 to 65536. Unfortunately, it is not improved yet. I am writing to ask do you have any suggestions on tuning parameters on small datasets like CIFAR-10? Thank you very much.

about the imagenet100

Could you tell me the details about how the imagenet100 subset was created?

Something went wrong when evaluating the results on ImageNet

When I evaluated the result on ImageNet (not the subset), I got the bug as follows:

THCudaCheckWarn FAIL file=/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/THCStream.cpp line=50 error=59 : device-side assert triggered

Does anyone have any thought about the issue?

ImageNet100 subset

Hi,
Would you please share the subset of ImageNet(ImageNet100) you used?
I want to train the MoCo model and compare it with your results!
Thanks!

Curious about the RandomResizedCrop parameters(minimum crop in your code)

Hi, thank you for sharing the code! I am curious about the effect of the data augmentation, concretely the RandomResizedCrop in train_moco_ins.py.
In your codes, the minimum crop scale is 0.2 for most choices but 0.08 for imagenet full dataset with ResNet, however the parameter in other papers such as non parametric instance discrimination is also set to 0.2 when using ResNet as backbone. So I am curious about the choice(0.08 as default torchvision parameter). Is this smaller scale work better in full imagenet? Have you validated the performance on imagenet between 0.08 and 0.2 with a ResNet backbone?

Reproducing MoCo on ImageNet-1k.

Hi Yonglong,

Thanks a lot for the great work and sharing the code. I am trying to reproduce the results of MoCo on ImageNet-1k, with ResNet 50. Did you reproduce the results on Kaiming's paper on the full ImageNet? Would you kindly share me the specific configurations for reproducing MoCo-ResNet-50?

Thanks a lot!

Support for resnet as backbone

Hi, thanks for open-sourcing the code. I wanted to know as to when will you enable the support for resnet to be used as a backbone.

other available views?

Could you please release the code of using other views instead of only "l" and "ab" in the training CMC process?

Code to recreate STL10 experiments

I enjoyed reading the paper and thanks for uploading the code.

Quick question - would it be possible to also upload the scripts to run the STL-10 eval?

Thanks!

Support for DistributedDataParallel

Thank you for sharing this great work with us.

I saw that you have spawn in the code, thus I am wondering your plan to release the code for supporting DistributedDataParallel. In particular, I am curious how do you sync the memory bank for L and ab, e.g., in self.register_buffer('memory_ab') during training.

Thank you :)

AlexNet weights trained on ImageNet

Hi,

Is it possible to get download link for AlexNet weights trained on ImageNet?

Thanks for sharing the code and ResNet weights.

ImageNet100

Hi,
I'm confused how can I get the imagenet100 dataset online, since I can't find any corresponding link for downloading.
Could you please share the link？

Thank you.

RuntimeError: expected backend CPU and dtype Double but got backend CPU and dtype Float

Hello,
Thank for providing the code !
I am trying to your code on Pytorch 1.1.0 and I receive this error.
Do you have this kind of error on other Pytorch version ?

Convention for number of convolutions in AlexNet

Hi there,

This is a bit of a meta-question.

I noticed that your code uses the original AlexNet parameters i.e. with convolutions 96,256,384,384,256 vs. the one weird trick paper 64,192,384,256,256 that is the standard in the official PyTorch implementation.

In comparison, Feng et al. at CVPR 2019 use the smaller version of AlexNet in their code.

I was wondering if there was a standard for which version of AlexNet should be used in the self-supervised literature, and if it even makes a difference?

Thanks

Worse results by reproducing MoCo, InsDIS and CMC on ImageNet100

Hi @HobbitLong,

Thanks for your nice paper and publlic code!

I have reproduced results of MoCo and InsDIS on ImageNet100 following your steps.
I got 67.44 for MoCo and 66.02 for InsDIS, which are worse than the expected 73.4 and 69.1.

Could you please help me about this?

Best
Mengyuan

Why use memory feature for positive samples?

Hi! Thanks for your code!

I have some questions about your implement. I notice that for negative samples we use memory bank cause 4096 is too large for 1 batch. But for positive samples, why still use memory bank rather than the feature calculated by this batch? Is there any harm for doing this?

Thanks

During the training, the loss and probs seem bad.

After 126 epochs for training, the loss still seems huge. And the probs for "L","ab" are only about 0.007. We set learning rate, batch size to 6e-2 and 1024 (8 Tesla-V100).

Train: [126][930/1252] BT 0.827 (0.953) DT 0.001 (0.234) loss 6.161 (6.071) l_p 0.007 (0.007) ab_p 0.006 (0.006)
torch.Size([1024, 16385, 1])
Train: [126][940/1252] BT 0.630 (0.951) DT 0.001 (0.232) loss 5.945 (6.071) l_p 0.007 (0.007) ab_p 0.006 (0.006)
torch.Size([1024, 16385, 1])

I don't know what's wrong with our experiment setting. Could you share the curves of training loss and probs of 'L' and 'ab'?

imagenet 100

Hi, @HobbitLong, could you please supply the classes you use for the imagenet 100 dataset? thanks. Is the imagenet 100 the same as imagenet except for the class number?

ImageNet-trainded resnet50/resnet101 weights

Hi, thank you for your code. Would you please provide a download link of ImageNet-trained resnet50/resnet101 weights?

High Values of Z_L and Z_ab

Hi,
Thanks for open-sourcing your work, I have been trying to use CMC on my custom toy dataset which has 2 views (Image (3D), Sensor view (3D)) I'm able to run the model successfully but the Z for view 1 and view 2 is being set to 119973150195712.

I made sure to use L2 norm around the final features from each of the alexnet halfs but I'm really not sure why the Z values are being initialized to such a high value. I kept the nce_m,nce_k and nce_t to the same as that of your code.

Please, can you help me with the same? Thank you

Question about data augmentation and memory bank

Hi, Thanks a lot for sharing this great code.
I have a question about data augmentation and the memory bank. If we use data augmentation, the features in the memory bank are not update for this issue. Especially for the positive examples which we using from the memory bank.
Have you thought about it?

Expected values of `ins_prob` and `ins_loss` in MoCo when training is working

Hi there

Thanks a lot for this great repo!

I am trying out MoCo on my own dataset (I also added additional augmentations). Training appears to have converged, but the max value I get for ins_prob is about 13.35, and the lowest value I get for loss is about 0.2422.

I am wondering what metrics you got when training on Imagenet? Am not sure what a "good" score should look like.

Here are screenshots from training progress in tensorboard (ignore the multiple lines at the start of training).

Thanks,
Liam

Accuracy on ImageNet using Resnet50v1

Hi,
Thanks for your released code. I want to check something puzzling me.
Does 'resnes50v2' represent 'ResNet-50' in Table 2 in the paper?
Does 'resnes50v3' represent 'ResNet-50 x2' in Table 2 in the paper?
If the answers are true, I want to know if you have trained 'resnet50v1' on ImageNet. Could you please share the results?

Thanks.

Questions about NCEAverage.py

Hi @HobbitLong , thank you for releasing the code. I wanted to ask a few questions regarding the implementation of NCEAverage.py. I understand some of them might be pretty basic questions but hopefully the answers will also help others to understand the code + implementation better.

What is the purpose of T=0.07 and why do out_l and out_ab need to be divided by T?

* Is there any advantage of starting out with unit vectors (on average) by implementing stdv = 1. / math.sqrt(inputSize / 3) here. I say this because out_l and out_ab need to be normalized anyway as is done here.

* Is this correct that you use a moving average (MA) to update weight_l and weight_ab (instead of just copying the values directly) because the model itself is learning and the values l and ab can be noisy? Using a MA reduces variance.
* As a follow up, how would this implementation be possible if you were not using memory banks? Is this an incidental advantage of using a memory bank?

[Resolved] Why did you not use a gradient descent based method to implement NCE? Was it done to reduce the overload of all things that needed to be learnt?
[Resolved] Lastly, since NCEAverage has no parameters or nn layers, I believe you don't need with torch.no_grad() here.~~

Thank you again.

Reproduce MoCov2 on ImageNet 1k

Hi @HobbitLong , I am trying to reproduce MoCo v2 on ImageNet 1k. Have you tried to replace the Linear projection head to MLP? Do you think it is necessary that add the batch normalization layer or bias for the fully connected layer? I keep all the hyper-param same as the paper but only could get 61.4~ acc with 4 gpu 256 batch size.

Would you kindly share with me the specific configurations based on your codebase for reproducing MoCov2-ResNet-50?

Thanks a lot!

The normalization of ImageNet is different from the mainstream setting.

Does the version of the torchvision impact the experimental setting, and which version should be used in the experiments?

Using dot product as a proxy for probability in NCEAverage

Hi, it seems that you are using the dot product between vectors from two views as a proxy for unknown distribution denoted as p_d in your paper here. In other words, your h_θ is the dot product. Theoretically any h_θ can work so it's all good.

But doesn't it force the two representations to be similar? I understand the two representations should have high mutual information. But it is not the same as having the two vectors in similar directions.

Obviously it worked out pretty well. But do you think having a parameterized NCEAverage loss would have allowed for more representations with not so similar directions but still having high MI?

Thank you again!

How to prevent an element in the enqueue come from the same sample as the query?

How to prevent an element in the enqueue come from the same sample as the query， especially when the dataloader‘s param "shuffle" is True?

Thank you

CMC/NCE/NCEAverage.py

Line 178 in 783bf95

l_neg = torch.mm(queue.detach(), q.transpose(1, 0))

Question about softmax loss

CMC/NCE/NCECriterion.py

Lines 35 to 46 in 58d06e9

 class NCESoftmaxLoss(nn.Module): 

 """Softmax cross-entropy loss (a.k.a., info-NCE loss in CPC paper)""" 

 def __init__(self): 

 super(NCESoftmaxLoss, self).__init__() 

 self.criterion = nn.CrossEntropyLoss() 

 def forward(self, x): 

 bsz = x.shape[0] 

 x = x.squeeze() 

 label = torch.zeros([bsz]).cuda().long() 

 loss = self.criterion(x, label) 

 return loss

Hi, I have a question about using softmax instead of NCE loss.
In that function, every label is set zero including the critic value of positive sample, which has index 0 of the batch.
I want to know the reason. My take on this is that the label should be [1, 0, 0, 0, ...]. Isn't it?

MoCo pre-trained weights

Would you please share the MoCo pre-trained weights?

Could You Please Share the Curve of Training Loss?

Hi,
I want to use CMC in my own experiment, but the loss is strange. At each epoch, the loss decays as normal (like from 20 to 11). But at the next epoch, the loss becomes nearly the same as begining (the loss is 20 again). I wonder if it is 'normal' in CMC.

Thanks.

The loss label of of NCESoftmaxLoss in NCECriterion.py?

Hi,

I see your code for NCESoftmaxLoss as follows:

#########
class NCESoftmaxLoss(nn.Module):
"""Softmax cross-entropy loss (a.k.a., info-NCE loss in CPC paper)"""
def init(self):
super(NCESoftmaxLoss, self).init()
self.criterion = nn.CrossEntropyLoss()

def forward(self, x):
    bsz = x.shape[0]
    x = x.squeeze()
    label = torch.zeros([bsz]).cuda().long()
    loss = self.criterion(x, label)
    return loss

###########
The label for this loss is label = torch.zeros([bsz]).cuda().long(), but in your paper, according to eq.2,

You have one positive for each sample.

So is something missed here??

Thanks.

Question about NCECriterion. py

CMC/NCE/NCECriterion.py

Line 30 in 58d06e9

loss = - (log_D1.sum(0) + log_D0.view(-1, 1).sum(0)) / bsz

I think the purpose of using NCE is to avoid expensive summation over entire vector in softmax. But in your implementation, there is still summation over entire log_D0 which confused me. I'll appreciate it if you explain this.
I'm new to this field, and hope you point out my misunderstanding if there is.

Unable to run pre-trained models

Hello,

Thanks for making this code available. I am trying to run the pertained alexnet model (downloaded from the Dropbox link) with the following command:

python LinearProbing.py --dataset imagenet --data_folder /share/ctn/users/jwl2182/imagenet_data --save_path . --model_path /home/jwl2182/CMC/CMC_alexnet.pth --model alexnet --learning_rate 0.1 --layer 5 --tb_path /home/jwl2182/CMC/tb --gpu 0

But I get the following error. Any ideas what might be happening?

RuntimeError: Error(s) in loading state_dict for MyAlexNetCMC:
Unexpected key(s) in state_dict: "encoder.module.l_to_ab.conv_block_1.1.num_batches_tracked", "encoder.module.l_to_ab.conv_block_2.1.num_batches_tracked", "encoder.module.l_to_ab.conv_block_3.1.num_batches_tracked", "encoder.module.l_to_ab.conv_block_4.1.num_batches_tracked", "encoder.module.l_to_ab.conv_block_5.1.num_batches_tracked", "encoder.module.l_to_ab.fc6.1.num_batches_tracked", "encoder.module.l_to_ab.fc7.1.num_batches_tracked", "encoder.module.ab_to_l.conv_block_1.1.num_batches_tracked", "encoder.module.ab_to_l.conv_block_2.1.num_batches_tracked", "encoder.module.ab_to_l.conv_block_3.1.num_batches_tracked", "encoder.module.ab_to_l.conv_block_4.1.num_batches_tracked", "encoder.module.ab_to_l.conv_block_5.1.num_batches_tracked", "encoder.module.ab_to_l.fc6.1.num_batches_tracked", "encoder.module.ab_to_l.fc7.1.num_batches_tracked".

Augmenting images at the evaluation of downstream classification task

CMC/eval_moco_ins.py

Lines 372 to 397 in 58d06e9

 for idx, (input, target) in enumerate(train_loader): 

 # measure data loading time 

 data_time.update(time.time() - end) 

 if opt.gpu is not None: 

 input = input.cuda(opt.gpu, non_blocking=True) 

 input = input.float() 

 target = target.cuda(opt.gpu, non_blocking=True) 

 # ===================forward===================== 

 with torch.no_grad(): 

 feat = model(input, opt.layer) 

 feat = feat.detach() 

 output = classifier(feat) 

 loss = criterion(output, target) 

 acc1, acc5 = accuracy(output, target, topk=(1, 5)) 

 losses.update(loss.item(), input.size(0)) 

 top1.update(acc1[0], input.size(0)) 

 top5.update(acc5[0], input.size(0)) 

 # ===================backward===================== 

 optimizer.zero_grad() 

 loss.backward() 

 optimizer.step()

If I understand the inner work of eval_moco_ins.py correctly, the code seems training the downstream task (single FC) using augmented images (train_transform == 'CJ').

This augmentation process not only slows down the training speed of the downstream task but also seems to violate the purpose of evaluation (Then we freeze the features and train a supervised linear classiﬁer, said in MoCo paper).

Isn't it right to save the center-cropped average pooled features and perform FC training on those fixed features?

About the category list of ImageNet-100 subset

Hi @HobbitLong, thanks for your great work and also sharing the code. I guess the ImageNet-100 is not a conventional subset so I wonder if you can share the list since we also don't have enough resources to run on the full ImageNet ==.

	class NCESoftmaxLoss(nn.Module):
	"""Softmax cross-entropy loss (a.k.a., info-NCE loss in CPC paper)"""
	def __init__(self):
	super(NCESoftmaxLoss, self).__init__()
	self.criterion = nn.CrossEntropyLoss()

	def forward(self, x):
	bsz = x.shape[0]
	x = x.squeeze()
	label = torch.zeros([bsz]).cuda().long()
	loss = self.criterion(x, label)
	return loss

	for idx, (input, target) in enumerate(train_loader):
	# measure data loading time
	data_time.update(time.time() - end)

	if opt.gpu is not None:
	input = input.cuda(opt.gpu, non_blocking=True)
	input = input.float()
	target = target.cuda(opt.gpu, non_blocking=True)

	# ===================forward=====================
	with torch.no_grad():
	feat = model(input, opt.layer)
	feat = feat.detach()

	output = classifier(feat)
	loss = criterion(output, target)

	acc1, acc5 = accuracy(output, target, topk=(1, 5))
	losses.update(loss.item(), input.size(0))
	top1.update(acc1[0], input.size(0))
	top5.update(acc5[0], input.size(0))

	# ===================backward=====================
	optimizer.zero_grad()
	loss.backward()
	optimizer.step()

hobbitlong / cmc Goto Github PK

cmc's People

Stargazers

Watchers

Forkers

cmc's Issues

Recommend Projects

Recommend Topics

Recommend Org