facebookresearch / moco Goto Github PK
View Code? Open in Web Editor NEWPyTorch implementation of MoCo: https://arxiv.org/abs/1911.05722
PyTorch implementation of MoCo: https://arxiv.org/abs/1911.05722
Thanks!
Hi,
Thanks for releasing the code!!! I think the launching method in the README should be updated a bit. I run like this:
python main_moco.py -a resnet50 --lr 0.03 --batch-size 256 --world-size 1 --rank 0 /data2/zzy/imagenet
And I got the error of:
Traceback (most recent call last):
File "main_moco.py", line 402, in <module>
main()
File "main_moco.py", line 133, in main
main_worker(args.gpu, ngpus_per_node, args)
File "main_moco.py", line 186, in main_worker
raise NotImplementedError("Only DistributedDataParallel is supported.")
NotImplementedError: Only DistributedDataParallel is supported.
I think the rank is not correctly assigned. Did I miss anything useful ?
Typically, we will use SyncBN in DDP to ensure that the computed gradients are identical across different GPUs. It maintains the models on different GPUs with exactly same parameters during training.
However, in moco training (IN-1M), the encoders consist of several vanilla BNs. How to ensure that the models across GPUs are with same parameters? Thanks.
I notice the finnal R@1 after 200epoches in ReadMe is 60%, but there is no code to valuate the model in the repo, only training accuracy.
Can you help me that how to valuate the perfomance during the training?
Hi, thanks for the amazing code!
When I tried to load a pretrained checkpoint for object detection, this error happens:
"ValueError: Unsupported type found in checkpoint! model: <class 'dict'>"
I can resolve this error if I only save the state_dict directly in the checkpoint without using a "model" key, but that would result in the "running_mean" and "running_bias" for batchnorm layers not loaded in the detector. I guess it has something to do with "matching_heuristics".
Thanks and looking forward for your reply!
Hi,
Thank you for open-sourcing this simple and clear repo!
I have tried to reproduce the R50-FPN
results on COCO, and am curious about the setting of the normalization. I have created a config file here. I wonder if you mind taking a look at if there is any difference between mine setting and yours?
Thank you!
According to your result, pretrain 200 epochs(Resnet 50 baseline) need 53H in a 8 V100 machine. But the training speed in my 8V100 machine is three/four times slower than this. I don't know why. Maybe the environment configs is different. So, can you release your environment configs? Thanks!
This is pretrain log, 0.6s per batch, 3000s(about 1h) per epoch.
2020-07-16T09:20:06.867Z: [1,0]<stdout>:Epoch: [16][4000/5004] Time 1.300 ( 0.685) Data 0.000 ( 0.084) Loss 1.0633e+00 (1.2471e+00) Acc@1 100.00 ( 95.40) Acc@5 100.00 ( 97.76)
2020-07-16T09:20:12.016Z: [1,0]<stdout>:Epoch: [16][4010/5004] Time 0.309 ( 0.685) Data 0.000 ( 0.084) Loss 1.4829e+00 (1.2472e+00) Acc@1 87.50 ( 95.40) Acc@5 93.75 ( 97.76)
2020-07-16T09:20:18.283Z: [1,0]<stdout>:Epoch: [16][4020/5004] Time 1.043 ( 0.685) Data 0.000 ( 0.084) Loss 1.1532e+00 (1.2472e+00) Acc@1 96.88 ( 95.40) Acc@5 96.88 ( 97.75)
2020-07-16T09:20:24.301Z: [1,0]<stdout>:Epoch: [16][4030/5004] Time 0.271 ( 0.685) Data 0.000 ( 0.084) Loss 1.1201e+00 (1.2469e+00) Acc@1 96.88 ( 95.40) Acc@5 100.00 ( 97.75)
2020-07-16T09:20:30.259Z: [1,0]<stdout>:Epoch: [16][4040/5004] Time 0.413 ( 0.684) Data 0.000 ( 0.083) Loss 1.4439e+00 (1.2468e+00) Acc@1 90.62 ( 95.40) Acc@5 93.75 ( 97.75)
2020-07-16T09:20:36.487Z: [1,0]<stdout>:Epoch: [16][4050/5004] Time 0.213 ( 0.684) Data 0.000 ( 0.083) Loss 1.1293e+00 (1.2468e+00) Acc@1 93.75 ( 95.40) Acc@5 100.00 ( 97.76)
2020-07-16T09:20:42.951Z: [1,0]<stdout>:Epoch: [16][4060/5004] Time 0.232 ( 0.684) Data 0.000 ( 0.083) Loss 1.1727e+00 (1.2470e+00) Acc@1 100.00 ( 95.40) Acc@5 100.00 ( 97.75)
2020-07-16T09:20:48.433Z: [1,0]<stdout>:Epoch: [16][4070/5004] Time 0.260 ( 0.684) Data 0.000 ( 0.083) Loss 1.3516e+00 (1.2469e+00) Acc@1 96.88 ( 95.40) Acc@5 96.88 ( 97.75)
2020-07-16T09:20:54.556Z: [1,0]<stdout>:Epoch: [16][4080/5004] Time 0.271 ( 0.684) Data 0.000 ( 0.083) Loss 1.0669e+00 (1.2469e+00) Acc@1 96.88 ( 95.40) Acc@5 100.00 ( 97.76)
2020-07-16T09:21:01.362Z: [1,0]<stdout>:Epoch: [16][4090/5004] Time 0.914 ( 0.684) Data 0.000 ( 0.082) Loss 1.3178e+00 (1.2468e+00) Acc@1 90.62 ( 95.40) Acc@5 96.88 ( 97.75)
2020-07-16T09:21:07.425Z: [1,0]<stdout>:Epoch: [16][4100/5004] Time 0.215 ( 0.683) Data 0.000 ( 0.082) Loss 9.2172e-01 (1.2467e+00) Acc@1 100.00 ( 95.40) Acc@5 100.00 ( 97.75)
2020-07-16T09:21:14.707Z: [1,0]<stdout>:Epoch: [16][4110/5004] Time 0.359 ( 0.684) Data 0.000 ( 0.082) Loss 1.3362e+00 (1.2468e+00) Acc@1 96.88 ( 95.40) Acc@5 96.88 ( 97.75)
➜ 2020-7-16 nvidia-smi
Thu Jul 16 09:41:17 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:05:00.0 Off | 0 |
| N/A 54C P0 181W / 250W | 4802MiB / 32480MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:08:00.0 Off | 0 |
| N/A 56C P0 109W / 250W | 4810MiB / 32480MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... Off | 00000000:0D:00.0 Off | 0 |
| N/A 42C P0 176W / 250W | 4808MiB / 32480MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... Off | 00000000:13:00.0 Off | 0 |
| N/A 43C P0 172W / 250W | 4810MiB / 32480MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-PCIE... Off | 00000000:83:00.0 Off | 0 |
| N/A 56C P0 197W / 250W | 4804MiB / 32480MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-PCIE... Off | 00000000:89:00.0 Off | 0 |
| N/A 58C P0 168W / 250W | 4810MiB / 32480MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-PCIE... Off | 00000000:8E:00.0 Off | 0 |
| N/A 43C P0 64W / 250W | 4810MiB / 32480MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-PCIE... Off | 00000000:91:00.0 Off | 0 |
| N/A 42C P0 157W / 250W | 4808MiB / 32480MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
It seems that this problem is caused by pytorch version. This is my running environment:
pytorch1.3.1-py36-cuda10.0-cudnn7.0
Hi,
I found a note:
"Note: for 4-gpu training, we recommend following the linear lr scaling recipe: --lr 0.015 --batch-size 128
with 4 gpus. We got similar results using this setting."
If my gpus have enough memory so that each gpu can handle batch size 64, then is it fine to use the original recipe --lr 0.03 --batch-size 256
?
Or, do you have a reason why you recommend to use (batch size) / (# gpus) = 32?
Hi! Thanks for this great code repo. Would it be possible to make available the pre-trained models for ResNet50 2x width and 4x width? These models were used in the original MOCO paper, but it requires a lot of resources to train such wide models.
On your side, how long it takes to train unsupervised pre-training of a ResNet-50 model on ImageNet in an 8-gpu machine for 800 epochs?
Thank you!
Thank you for providing such clear and easy-to-follow code for this great project! I was just curious about line 146 in builder.py
:
l_neg = torch.einsum('nc,ck->nk', [q, self.queue.clone().detach()])
Is it necessary to make a copy of the queue at all? Does this introduce unnecessary overhead? Or am I misunderstanding something?
I am new to pytorch DistributedDataParallel (DDP), and not clear about the shuffleBN process.
In the code, you first do concat_all_gather(), and then broadcast a random indexes to every devices from src=0.
Here is my question:
Is only device 0 broadcasting? Does other devices doing __batch_shuffle_ddp()_?
Just a nit, but print statements are used throughout the code rather than configurable loggers. Since the print builtin is overridden on processes other than process 0, this can be surprising to developers. Consider using python's logging module to make this configuration more standardized and clear.
I am using macOS and I tried to download the pre-trained model with curl -OL https://dl.fbaipublicfiles.com/moco/moco_checkpoints/moco_v2_800ep/moco_v2_800ep_pretrain.pth.tar
. I tried to decompress the tar file with tar xvf moco_v2_800ep_pretrain.pth.tar
but received the error tar: Error opening archive: Unrecognized archive format
. Could you verify that the pre-trained model files are not corrupted please? Thank you very much.
Hi,
I am wondering what kind of modifications that I should make to be able to train on COCO?
Epoch: [34][3590/4999] Time 0.426 ( 1.635) Data 0.000 ( 0.227) Loss 6.8926e+00 (6.9147e+00) Acc@1 73.44 ( 76.76) Acc@5 87.50 ( 87.55)
Epoch: [34][3600/4999] Time 0.437 ( 1.638) Data 0.000 ( 0.227) Loss 7.0694e+00 (6.9147e+00) Acc@1 59.38 ( 76.76) Acc@5 76.56 ( 87.55)
Epoch: [34][3610/4999] Time 0.432 ( 1.638) Data 0.000 ( 0.226) Loss 6.9074e+00 (6.9146e+00) Acc@1 78.12 ( 76.76) Acc@5 90.62 ( 87.55)
Epoch: [34][3620/4999] Time 0.423 ( 1.639) Data 0.000 ( 0.225) Loss 6.9464e+00 (6.9146e+00) Acc@1 71.88 ( 76.76) Acc@5 85.94 ( 87.55)
Epoch: [34][3630/4999] Time 0.436 ( 1.644) Data 0.000 ( 0.225) Loss 6.8364e+00 (6.9145e+00) Acc@1 81.25 ( 76.77) Acc@5 89.06 ( 87.56)
Epoch: [34][3640/4999] Time 0.425 ( 1.646) Data 0.000 ( 0.224) Loss 6.9520e+00 (6.9145e+00) Acc@1 71.88 ( 76.76) Acc@5 85.94 ( 87.56)
Epoch: [34][3650/4999] Time 0.426 ( 1.646) Data 0.000 ( 0.224) Loss 6.8319e+00 (6.9145e+00) Acc@1 84.38 ( 76.77) Acc@5 87.50 ( 87.56)
Epoch: [34][3660/4999] Time 0.428 ( 1.646) Data 0.000 ( 0.223) Loss 6.8066e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 90.62 ( 87.57)
Epoch: [34][3670/4999] Time 0.471 ( 1.651) Data 0.000 ( 0.222) Loss 6.9694e+00 (6.9144e+00) Acc@1 78.12 ( 76.77) Acc@5 89.06 ( 87.57)
Epoch: [34][3680/4999] Time 0.431 ( 1.650) Data 0.000 ( 0.222) Loss 6.8628e+00 (6.9144e+00) Acc@1 81.25 ( 76.77) Acc@5 87.50 ( 87.57)
Epoch: [34][3690/4999] Time 0.428 ( 1.650) Data 0.000 ( 0.221) Loss 6.8666e+00 (6.9145e+00) Acc@1 81.25 ( 76.77) Acc@5 92.19 ( 87.56)
Epoch: [34][3700/4999] Time 0.434 ( 1.650) Data 0.000 ( 0.221) Loss 6.9402e+00 (6.9144e+00) Acc@1 71.88 ( 76.78) Acc@5 87.50 ( 87.57)
Epoch: [34][3710/4999] Time 0.434 ( 1.654) Data 0.000 ( 0.220) Loss 6.8522e+00 (6.9144e+00) Acc@1 81.25 ( 76.78) Acc@5 92.19 ( 87.57)
Epoch: [34][3720/4999] Time 0.421 ( 1.655) Data 0.000 ( 0.219) Loss 6.8393e+00 (6.9145e+00) Acc@1 79.69 ( 76.78) Acc@5 90.62 ( 87.57)
Epoch: [34][3730/4999] Time 0.426 ( 1.658) Data 0.000 ( 0.219) Loss 6.9804e+00 (6.9145e+00) Acc@1 68.75 ( 76.78) Acc@5 81.25 ( 87.57)
Epoch: [34][3740/4999] Time 0.424 ( 1.658) Data 0.000 ( 0.218) Loss 7.0028e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 82.81 ( 87.57)
Epoch: [34][3750/4999] Time 0.438 ( 1.662) Data 0.000 ( 0.218) Loss 6.9528e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 82.81 ( 87.57)
Epoch: [34][3760/4999] Time 0.423 ( 1.664) Data 0.000 ( 0.217) Loss 6.8455e+00 (6.9143e+00) Acc@1 76.56 ( 76.79) Acc@5 93.75 ( 87.57)
Epoch: [34][3770/4999] Time 0.430 ( 1.666) Data 0.000 ( 0.217) Loss 6.9374e+00 (6.9143e+00) Acc@1 81.25 ( 76.79) Acc@5 90.62 ( 87.57)
I use the following command to train on ImageNet with 4 2080ti:
python main_moco.py -a resnet50 --mlp --moco-t 0.2 --aug-plus --cos --lr 0.015 --batch-size 256 --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 /job/large_dataset/open_datasets/ImageNet/
I doubt it is training on the supervised manner. Are there anything wrong with my experiments?
I got an error frequently when distributed training is enabled. It occurs roughly for every 50~100 epochs. Here is the error message:
terminate called after throwing an instance of 'std::system_error'
what(): Transport endpoint is not connected
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
Could you help me to resolve the issue?
If I want to use one or two GPUs in one server? What should I do?
Hi, would you mind also uploading the weights (or whole checkpoint) for a model with the linear classifier on ImageNet? I'm running main_lincls.py
myself currently, but it looks like it will take quite some time to get through the 100ep needed and I guess it can be generally useful to others to have these weights readily downloadable.
When I use a node with 2 GPUs to run this code, I got this problem. Could you guys help me to solve this?
Sincerely
There are some bn layers in key and query encoder, how to deal with these layers, freezing it or not freezing? Does someone do some experiments about comparing freezing bn layer with not freezing bn layer. When i refer to others' implentation of moco(https://github.com/HobbitLong/CMC/blob/master/train_moco_ins.py , line 412), i noticed that he freezed the bn layer of key encoder but didn't freeze bn layer of query encoder, and in official implentation, both encoder didn't freeze bn layer. I wonder does freezing bn layer of encoder have effect on final result, and why?
When I apply MoCo on other datasets, I change the length of the queue and other parameters. I find that it is very hard to judge the convergence of the model.
Some times the accuracy increases very fast and loss approaches 0 quickly. Under other circumstances, the loss steadily increases while accuracy remains low. Both situations could lead to good feature extractor. That is because those metrics heavily rely on the queue length. Large queue length leads to excessive negative samples,which is extremely unbalanced to learn. Small queue length could also degrade the instance discrimination problem, with insufficient difficulty. Problems that are too simple or too hard will affect the learning process。
How can we judge the convergence of the MoCo training? And How can we select a proper queue length depending on our dataset size?
Hi, thx for the super work!!!
But, the resnet50 is too heavy. I wonder if the RegNet will be released, which is trained by moco?
Suppose the parameters of satisfies , and if the parameters of satisfies , then follow the update rule for the momentum encoder: , we can infer that ![](http://latex.codecogs.com/png.latex?\lim_{n\to +\infty}\theta_k=\lim_{n\to+\infty}\mathit{m}\theta_k+\lim_{n\to+\infty}(1-\mathit{m})\theta_q), thus . But how can we prove the convergence of ??
Ditto. I kept wondering about SyncBN vs ShuffleBN as to whether the former can effectively prevent cheating.
SimCLR appears to be using SyncBN (referred to as "Global BN").
SyncBN is out of the box with PyTorch whereas Shuffling BN requires a bit more hacking. The fact that Shuffling BN is chosen must mean that it is better? (or that SyncBN wasn't ready at the time MoCo was designed?)
Thank you for releasing the code.
In training moco,
optimizer = torch.optim.SGD(model.parameters(), args.lr,
momentum=args.momentum,
weight_decay=args.weight_decay)
whether model.parameters() should be just model.encoder_q.parameters()? (or the weight_decay has been tuned accordingly for the entire model)?
Hi @KaimingHe , We had fun reading your paper and thank you for sharing your work.
Fig. 3 in mocov1 paper compares various contrastive learning mechanisms accuracy for varying K. We could not find this plot for mocov2. Could you please share this for mocov2? Basically acc for those 6 points K=(256, ... 65536)
thanks again,
Srikar
Hi. Thanks for this amazing work.
I have a question though. Is there a specific reason for which you choose the augmentations Jitter, Grayscale and Gaussian Blur? Do you know if stronger augmentations like randaugment with random magnitude or auto augment could provide better results? Or will theses hurt the performances?
Thanks
According to the paper, during training you run for a default of 200 epochs and multiply the learning rate by 0.1 at 120 and 160 epochs. During finetuning, these numbers turn out to be 100, 60, and 80 respectively. For the finetuning case, this would imply a learning rate of 30, then 3, then 0.3, however this is not what the logic of the milestone scheduler performs.
def adjust_learning_rate(optimizer, epoch, args):
"""Decay the learning rate based on schedule"""
lr = args.lr
for milestone in args.schedule:
lr *= 0.1 if epoch >= milestone else 1.
for param_group in optimizer.param_groups:
param_group['lr'] = lr
Instead what happens is that the learning rate is constant at 30 up until epoch 59, and then at every epoch between 60 and 79 it is multiplied by 0.1. Furthermore, at epochs 80 to 100, it is multiplied by 0.1 twice in each epoch cycle (once for epoch >= 60
, and again for epoch >= 80
. You end up with a final learning rate that is essentially zero. The key is the greater than or equal to operator, which should be just an equal operator. Correct me if I'm wrong, but shouldn't the logic be:
def adjust_learning_rate(optimizer, epoch, args):
"""Decay the learning rate based on schedule"""
lr = args.lr
for milestone in args.schedule:
lr *= 0.1 if epoch == milestone else 1.
for param_group in optimizer.param_groups:
param_group['lr'] = lr
In order to conduct the kind of stepwise learning rate scheduling that is described in the paper?
Hi,
Thanks for your impressive work.
In moco/builder.py
, line 63:
self.queue[:, ptr:ptr + batch_size] = keys.T
I suppose that the keys
is a Tensor with the batch_size dim, and T
is a float scaler attribute of self
as self.T
.
So an AttributeError AttributeError: 'Tensor' object has no attribute 'T'
would be raised if I directly run the train code.
Should it be self.T
(I guess)? or any specific setting i missed?
Regards,
Hi!
I am trying to incorporate tensorboard with the following snippet in the train
function
if args.gpu == 0:
args.tb.add_scalar('loss/train', loss.item(), (len(train_loader)*epoch)+i)
args.tb.add_scalar('acc1/train', acc1[0], (len(train_loader)*epoch)+i)
But I am receiving TypeError: can't pickle _thread.lock objects
error originating from mp.spawn()
.
Any way out?
Solved
Hello --
I'm trying to reproduce some of these results on a different dataset, and the loss slowly bounces up and down, without converging (see below). Is that expected behavior? I don't think the paper shows what the loss/pretext accuracy look like in the ImageNet training -- might it be possible to share those plots here?
Thanks!
Edit: Note, my dataset has ~ 250K images, so ~25% the size of Imagnet -- I'm wondering whether the difference in dataset sizes could be causing problems? Eg perhaps because the length of the momentum buffer is 4x larger relative to the size of the dataset.
When I try to run voc detection training with command lines
python train_net.py --config-file configs/pascal_voc_R_50_C4_24k_moco.yaml --num-gpus 4 MODEL.WEIGHTS ./output.pkl
, it ran out of memory on 4 GTX 2080Ti (11G).
I think it makes no sense since the original pascal_voc_R_50_C4_24k_moco
in detectron2 only takes about 7.5G per GPU. I found that the only difference between them lies in RESNETS.NORM
, which is set to FrozenBN
in detectron2 while SyncBN
for moco.
I tried to change it to FrozenBN
and the memory footprint looks good except that the loss turns to Nan after 30~40 iterations. Only after decreasing the lr from 0.2 to 0.05 can it maintain training stability. I am not sure why the SyncBN
will add large amount of memory expense. Whether the training stability is caused by removing the SyncBN
? Any help will be appreciated. Thanks.
Hi,
I am pretty new to DDP and I am wondering what is the purpose for the following line:
Line 268 in 3631be0
Awesome work!
In my opinion, shuffleBN is proposed to maintain the differences of running mean and variance between encoder q and encoder k, which prevents local optimal encoder parameters. How do you evaluate the benefits of shuffleBN?
Moreover, distribute training of MoCo suffers from the time-consuming broadcast and allgather operations in shuffleBN. Do you have any suggestion for accelerating distribute training with shuffleBN?
how to load moco trained efficientnet weights to EfficientNet defined in efficientnet_pytorch?keys don't match.
Hello, @ppwwyyxx @KaimingHe
I use MOCO to train on mnist dataset as an easy example. The mnist train.py is refered from pytorch mnist.
It is easy to reach 99% when directly train with supervised setting.
When I use moco method to pretrain the model firstly, and then I finetune the pretrained weight (here the conv weight is frozen, and only the fc layer can be changed), the performance on the test set can only reach 95%, and could not get better result.
Concretely, when training with mnist dataset, the length of the queue I set is 3840 rather than the default setting 65536. Because the mnist dataset length is smaller than ImageNet.
Does this means the feature extraction network is not trained well, can you give me some suggestions on this phenomenon?
What's more, can you give me some suggestions that how to train on custom dataset? What change is required in hyperparams?
I am trying to run code on my own data (apart from datasets mentioned in paper). During training, it is seen that accuracy is 100% for 1st epoch 1st 10 batches, however it is decreasing to 0 or comparatively very small value throughout further training. Also loss is increasing all time. Snapshot for reference.
Hi @KaimingHe , @ppwwyyxx
I am unable to reproduce the wall clock efficiency from the paper. Here's a screen shot of the linear classifier steps/sec:
I observed a similar throughput for the pre-training as well. The total pre-training took me around 120 hrs to finish. And the linear classifier training is slow as well.
Why all targets are zeros?
Line 155 in 3631be0
Hi,
Following the instructions for both 'Unsupervised Training' and 'Linear Classification', I find different model parameters are initialized in each GPU worker. Because random seed is not set inside main_worker
function.
For pytorch DistributedDataParallel
, do you think initializing the same set of model parameters across all GPU workers could give more accurate gradient and better performance?
Thanks!
Hi, I am curious about the detection downstream task. My question is that if we must use 8 GPUs to reproduce the performance in your paper. And is there any way to reproduce the official number with 4 or less GPUs?
Thanks.
Can you please provide checkpoints from the pretraining step (main_moco.py)? When I use the checkpoints you provided to resume pretraining with main_moco.py, I receive errors regarding missing weights. The checkpoints work fine when I use them with the main_lincls.py script.
Hey, thanks for your contribution to unsupervised CNN learning.
I would like to do some research based on your architecture, but unfortunately, I don't possess multiple GPUs. Would it be easy to change this architecture to run on 1gpu system?
Affected methods would be:
concat_all_gather, forward function _batch_unshuffle_ddp, _batch_shuffle_ddp
On top of that, I have a windows server which doesn't support distributed module.
Thanks
Hi,i did unsupervised pre-training of a ResNet-50 model on a dataset which contains 122,208 unlabeled bird images and the last epoch log is below:
the loss stucks at ~6.90 which is similar to another closed issue #12. In that issue it seems not tha bad. Is this normal?
Then i use this pretrained model to train and eval on a dataset which contains 3,959 train images and 2000 val images. These images are in 200 categories of birds. I follow the
'''
python main_lincls.py
-a resnet50
--lr 30.0
--batch-size 256
--pretrained [your checkpoint path]/checkpoint_0199.pth.tar
--dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0
[your imagenet-folder with train and val folders]
'''
however the validate accuracy is quite low (~12%), which is much lower than supervised training method(~60%). I tried serval learning rate (0.1, 5,10, 100.0) but the results seems still bad.
So can i ask how do you set these hyperparameters?Or, the pretrained model is bad? how can i check this probelm?
Thanks!
Hi, thanks for open-sourcing this excellent repo.
I try to reproduce the performance of Mask R-CNN (R50-FPN, 1x)
following #34 (comment).
But there still a gap between the reproduced AP (34.8%) and the AP reported in the paper (35.1%).
Are there differences between our config file and yours?
This is our config file:
_BASE_: "Base-RCNN-FPN.yaml"
MODEL:
PIXEL_MEAN: [123.675, 116.280, 103.530]
PIXEL_STD: [58.395, 57.120, 57.375]
MASK_ON: True
WEIGHTS: "Mocov1 Model"
BACKBONE:
FREEZE_AT: 0
RESNETS:
DEPTH: 50
NORM: "SyncBN"
STRIDE_IN_1X1: False
FPN:
NORM: "SyncBN"
TEST:
PRECISE_BN:
ENABLED: True
EVAL_PERIOD: 5000
SOLVER:
STEPS: (60000, 80000)
MAX_ITER: 90000
INPUT:
FORMAT: "RGB"
OUTPUT_DIR: "./output/mask_fpn_1x_mocov1/"
I am trying to train MoCo V2 on a machine with 2 GPUs using the hyperparameters recommended in this repo. However, the loss function gets stuck at value 6.90-ish. Is this behaviour normal or should I try with a different set of hyperparameters? I see that you have used a machine with 8 GPUs. Could this explain the difference? Thanks!
Hi, thanks for the great work.
I tried to reproduce your results on COCO keypoint detection using the pertained MOCO model provided. I strictly followed the training pipeline in moco/detection
and used the configs in detectron2/configs/COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml
. But training diverged after ~700 iterations as loss became NaN.
I have tried reduced the base lr but it does not seem to help much. Also, as I am using imgs_per_batch = 16, I don't feel like a super small base lr is appropriate.
So:
=======
the command I run is python moco/detection/train_net.py --config-file configs_keypoints/keypoint_rcnn_R_50_FPN_3x.yaml \ --num-gpus 2 MODEL.WEIGHTS ./output.pkl
The following is the config file generated after running train_net.py
CUDNN_BENCHMARK: false
DATALOADER:
ASPECT_RATIO_GROUPING: true
FILTER_EMPTY_ANNOTATIONS: true
NUM_WORKERS: 4
REPEAT_THRESHOLD: 0.0
SAMPLER_TRAIN: TrainingSampler
DATASETS:
PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
PROPOSAL_FILES_TEST: []
PROPOSAL_FILES_TRAIN: []
TEST:
- keypoints_coco_2017_val
TRAIN:
- keypoints_coco_2017_train
GLOBAL:
HACK: 1.0
INPUT:
CROP:
ENABLED: false
SIZE:
- 0.9
- 0.9
TYPE: relative_range
FORMAT: BGR
MASK_FORMAT: polygon
MAX_SIZE_TEST: 1333
MAX_SIZE_TRAIN: 1333
MIN_SIZE_TEST: 800
MIN_SIZE_TRAIN:
- 640
- 672
- 704
- 736
- 768
- 800
MIN_SIZE_TRAIN_SAMPLING: choice
MODEL:
ANCHOR_GENERATOR:
ANGLES:
- - -90
- 0
- 90
ASPECT_RATIOS:
- - 0.5
- 1.0
- 2.0
NAME: DefaultAnchorGenerator
OFFSET: 0.0
SIZES:
- - 32
- - 64
- - 128
- - 256
- - 512
BACKBONE:
FREEZE_AT: 0
NAME: build_resnet_fpn_backbone
DEVICE: cuda
FPN:
FUSE_TYPE: sum
IN_FEATURES:
- res2
- res3
- res4
- res5
NORM: ''
OUT_CHANNELS: 256
KEYPOINT_ON: true
LOAD_PROPOSALS: false
MASK_ON: false
META_ARCHITECTURE: GeneralizedRCNN
PANOPTIC_FPN:
COMBINE:
ENABLED: true
INSTANCES_CONFIDENCE_THRESH: 0.5
OVERLAP_THRESH: 0.5
STUFF_AREA_LIMIT: 4096
INSTANCE_LOSS_WEIGHT: 1.0
PIXEL_MEAN:
- 103.53
- 116.28
- 123.675
PIXEL_STD:
- 1.0
- 1.0
- 1.0
PROPOSAL_GENERATOR:
MIN_SIZE: 0
NAME: RPN
RESNETS:
DEFORM_MODULATED: false
DEFORM_NUM_GROUPS: 1
DEFORM_ON_PER_STAGE:
- false
- false
- false
- false
DEPTH: 50
NORM: SyncBN
NUM_GROUPS: 1
OUT_FEATURES:
- res2
- res3
- res4
- res5
RES2_OUT_CHANNELS: 256
RES5_DILATION: 1
STEM_OUT_CHANNELS: 64
STRIDE_IN_1X1: true
WIDTH_PER_GROUP: 64
RETINANET:
BBOX_REG_WEIGHTS: &id001
- 1.0
- 1.0
- 1.0
- 1.0
FOCAL_LOSS_ALPHA: 0.25
FOCAL_LOSS_GAMMA: 2.0
IN_FEATURES:
- p3
- p4
- p5
- p6
- p7
IOU_LABELS:
- 0
- -1
- 1
IOU_THRESHOLDS:
- 0.4
- 0.5
NMS_THRESH_TEST: 0.5
NUM_CLASSES: 80
NUM_CONVS: 4
PRIOR_PROB: 0.01
SCORE_THRESH_TEST: 0.05
SMOOTH_L1_LOSS_BETA: 0.1
TOPK_CANDIDATES_TEST: 1000
ROI_BOX_CASCADE_HEAD:
BBOX_REG_WEIGHTS:
- - 10.0
- 10.0
- 5.0
- 5.0
- - 20.0
- 20.0
- 10.0
- 10.0
- - 30.0
- 30.0
- 15.0
- 15.0
IOUS:
- 0.5
- 0.6
- 0.7
ROI_BOX_HEAD:
BBOX_REG_WEIGHTS:
- 10.0
- 10.0
- 5.0
- 5.0
CLS_AGNOSTIC_BBOX_REG: false
CONV_DIM: 256
FC_DIM: 1024
NAME: FastRCNNConvFCHead
NORM: ''
NUM_CONV: 0
NUM_FC: 2
POOLER_RESOLUTION: 7
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
SMOOTH_L1_BETA: 0.5
TRAIN_ON_PRED_BOXES: false
ROI_HEADS:
BATCH_SIZE_PER_IMAGE: 512
IN_FEATURES:
- p2
- p3
- p4
- p5
IOU_LABELS:
- 0
- 1
IOU_THRESHOLDS:
- 0.5
NAME: StandardROIHeads
NMS_THRESH_TEST: 0.5
NUM_CLASSES: 1
POSITIVE_FRACTION: 0.25
PROPOSAL_APPEND_GT: true
SCORE_THRESH_TEST: 0.05
ROI_KEYPOINT_HEAD:
CONV_DIMS:
- 512
- 512
- 512
- 512
- 512
- 512
- 512
- 512
LOSS_WEIGHT: 1.0
MIN_KEYPOINTS_PER_IMAGE: 1
NAME: KRCNNConvDeconvUpsampleHead
NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true
NUM_KEYPOINTS: 17
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
ROI_MASK_HEAD:
CLS_AGNOSTIC_MASK: false
CONV_DIM: 256
NAME: MaskRCNNConvUpsampleHead
NORM: ''
NUM_CONV: 4
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
RPN:
BATCH_SIZE_PER_IMAGE: 256
BBOX_REG_WEIGHTS: *id001
BOUNDARY_THRESH: -1
HEAD_NAME: StandardRPNHead
IN_FEATURES:
- p2
- p3
- p4
- p5
- p6
IOU_LABELS:
- 0
- -1
- 1
IOU_THRESHOLDS:
- 0.3
- 0.7
LOSS_WEIGHT: 1.0
NMS_THRESH: 0.7
POSITIVE_FRACTION: 0.5
POST_NMS_TOPK_TEST: 1000
POST_NMS_TOPK_TRAIN: 1500
PRE_NMS_TOPK_TEST: 1000
PRE_NMS_TOPK_TRAIN: 2000
SMOOTH_L1_BETA: 0.0
SEM_SEG_HEAD:
COMMON_STRIDE: 4
CONVS_DIM: 128
IGNORE_VALUE: 255
IN_FEATURES:
- p2
- p3
- p4
- p5
LOSS_WEIGHT: 1.0
NAME: SemSegFPNHead
NORM: GN
NUM_CLASSES: 54
WEIGHTS: ./output.pkl
OUTPUT_DIR: ./output
SEED: -1
SOLVER:
BASE_LR: 0.02
BIAS_LR_FACTOR: 1.0
CHECKPOINT_PERIOD: 5000
CLIP_GRADIENTS:
CLIP_TYPE: value
CLIP_VALUE: 1.0
ENABLED: false
NORM_TYPE: 2.0
GAMMA: 0.1
IMS_PER_BATCH: 16
LR_SCHEDULER_NAME: WarmupMultiStepLR
MAX_ITER: 180000
MOMENTUM: 0.9
NESTEROV: false
STEPS:
- 120000
- 160000
WARMUP_FACTOR: 0.001
WARMUP_ITERS: 1000
WARMUP_METHOD: linear
WEIGHT_DECAY: 0.0001
WEIGHT_DECAY_BIAS: 0.0001
WEIGHT_DECAY_NORM: 0.0
TEST:
AUG:
ENABLED: false
FLIP: true
MAX_SIZE: 4000
MIN_SIZES:
- 400
- 500
- 600
- 700
- 800
- 900
- 1000
- 1100
- 1200
DETECTIONS_PER_IMAGE: 100
EVAL_PERIOD: 0
EXPECTED_RESULTS: []
KEYPOINT_OKS_SIGMAS: []
PRECISE_BN:
ENABLED: true
NUM_ITER: 200
VERSION: 2
VIS_PERIOD: 0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.