Hello, I used your trained model on the datasets that you use, and I was able to repro

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Thank you both <a class="user-mention notranslate" data-hovercard-type="user" data-hov

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

What does a fixed split mean? <a class="issue-link js-issue-link" data-error-text=

Reproducing results on MSLS about mixvpr HOT 9 CLOSED

gmberton commented on August 26, 2024

Reproducing results on MSLS

from mixvpr.

Comments (9)

amaralibey commented on August 26, 2024 1

Hello @gmberton, thank you for the question,

We did no pre-processing nor any post-processing, we only resize images to 320x320.

We are using the exact same procedure that the original authors published on their GitHub (see demo: https://github.com/mapillary/mapillary_sls/blob/main/demo.ipynb).

Note that in order to evaluate on the validation dataset, you need to initialize SAMPLE_CITIES = "val". and also use mode = 'val' when instantiating MSLS object. (mode='test' works just fine), but only mode='val' that will return you the ground_truth (the positives for each query).

There are 750 query images in msls_val from which they use 740 (3 are panoramas and the others are simply ignored (https://github.com/mapillary/mapillary_sls/blob/b6c4e43fcd6d5a8a7f40295ee240f8cf0448891d/evaluate.py#L116).

I rewrote the code (https://github.com/amaralibey/MixVPR/blob/main/dataloaders/MapillaryDataset.py) for MSLS to avoid listing all images into a list and generate the ground each time. I simply saved into a numpy array the qImages, the dbImages and the pIdx (which is the groud truth with threshold of 25meters for a correct match). This made it way faster for the im2im task (image to image recall@k). Also this allows us to use the same evaluation function I developed for GSV-Cities framework (https://github.com/amaralibey/MixVPR/blob/main/utils/validation.py).

Finally, I just re-evaluated our model on the demo shared by the authors of MSLS and got the exact same results:

all_recall@1: 0.882
all_recall@2: 0.903
all_recall@3: 0.922
all_recall@4: 0.928
all_recall@5: 0.931
all_recall@10: 0.943
all_recall@15: 0.955
all_recall@20: 0.955
all_recall@25: 0.957
all_map@1: 0.882
all_map@2: 0.824
all_map@3: 0.782
all_map@4: 0.747
all_map@5: 0.709
all_map@10: 0.644
all_map@15: 0.625
all_map@20: 0.619
all_map@25: 0.620

Please let me know if there's something I missed or misunderstood.
Thank you,

from mixvpr.

gmberton commented on August 26, 2024 1

Thank you both @amaralibey and @hit2sjtu for your thorough answers!
I was confused because I always followed the procedure from the MSLS paper, where images from Copenhagen and San Francisco are used for validation (excluding the panoramas), so I always used all their queries (11k). I see that however in the code they select a much smaller subset (also from Copenhagen and San Francisco) of 740 images according to the fields of the files subtask_index.csv, so I should use those to reproduce your results :)

from mixvpr.

gmberton commented on August 26, 2024 1

@amaralibey Fairness of comparison can be tricky: I'm not sure that training all methods on GSV-Cities is the best way to make a fair comparison, given that some methods might not be suitable for it.
For example CosPlace is designed for training on large-scale dense datasets, and in the paper it is shown to perform poorly when trained on smaller datasets, whereas other methods (like NetVLAD) might be more suitable for smaller datasets.
Also, using the same backbone for all methods is generally a good idea, but in some cases a given backbone can give an advantage to some methods: for example, NetVLAD (and probably MixVPR) performs best when a ResNet backbone is cropped at the second last layer, whereas GeM performs best when cropping the ResNet at the last layer.
In my opinion there are very few things that bring benefits to any method (bigger training datasets not being one of them), for example using a bigger batch size or better data augmentation.

from mixvpr.

gmberton commented on August 26, 2024 1

I totally agree with you that not using a fixed split between queries and database can help to increase the effective training data (we also didn't use a fixed split for CosPlace), and the mining method that you use clearly improves the results of GeM and NetVLAD, congrats on that!

On the other hand CosPlace is designed to be trained on a dense and large scale (>1M images) dataset, so training it on a sparse and smaller dataset (as is GSV-Cities) might not be fair, given that we clearly show such limitation of CosPlace in the paper.

But again, it is very difficult (if not impossible) to design completely fair experiments in VPR, because different methods might benefit from different training choices.
Also, higher resolution does not always mean better results: you can see in this benchmark that sometimes lower resolutions can improve the recall, although in CosPlace we used similar resolutions to all previous works.

from mixvpr.

hit2sjtu commented on August 26, 2024

@gmberton
Here is my reproduction result using the checkpoint"resnet50_MixVPR_4096_channels(1024)_rows(4).ckpt'.
For your reference I use my previous script which loads MSLS images and performs image resizing to 320x320 (cubic sampling) and same normalization as here.
Then I use faiss to perform nearest-search to get top 5 match and finally get recall&map using MSLS official evaluation script. The metrics are computed using a total of 740 images as @amaralibey said.
all_recall@1: 0.877
all_recall@5: 0.931
all_map@1: 0.877
all_map@5: 0.706

from mixvpr.

amaralibey commented on August 26, 2024

@hit2sjtu, thank you for trying our models :) your results are correct when using BICUBIC interpolation to resizeimages to 320x320 (I've tried BICUBIC interpolation when testing and got the same results as you.). However, since our models have been trained on images resized with BILINEAR interpolation, it's preferable to resize the test/valid sets with the same interpolation used during the training. For instance, if you want the best performance (as those I shared) then you just need to use BILINEAR interpolation when resizing :)

@gmberton now I understand, thanks for clarifying. Sometimes (and most of the time in VPR), the code doesn't reflect exactly what was presented in the paper. In our comparison in Table 1 and Figure 3, we trained all networks on the same data and used the same script for validation and testing, it was always about fairness of comparison.

Thanks to both of your,
Best regards.

from mixvpr.

hit2sjtu commented on August 26, 2024

I did just try some simple training on single 4080 GPU with no param tuning. I have to limit the batch size to 100 due to memory constraints. After only 6 hours (40 epochs), the results on MSLS looks great though. Really impressive @amaralibey
all_recall@1: 0.858
all_recall@5: 0.923

from mixvpr.

amaralibey commented on August 26, 2024

@gmberton In our paper we chose the best configuration for CosPlace (we cropped ResNet at the last residual block, not the second last, because that's how it was configured in your paper). We used the same image size (320x320) and batch size as other techniques.

As for the dataset, I think that GSV-Cities is not a small dataset for VPR, because instead of fixing queries and forming pairs and triplets around them, we approached the training as a metric learning problem, where every image in the dataset is used as query and as reference (depending on how it was selected in the online mining step), that makes the number of informative triplets and pairs exponentially higher than when fixing the number of queries. That's one reason why both GeM (which you showed that it benefits from bigger datasets) and NetVLAD obtain a new sota performance when trained on GSV-Cities (which is ~80x smaller than SF-XL). I believe that Metric Learning on GSV-Cities has the potential to bring a lot to VPR.

Finally, when training CosPlace on GSV-Cities, I think we could've achieved better results if we used images of size 480x640 or 512x512 as used in your paper. However, due to the limited resolution of images in GSV-Cities, we opted for 320x320 resolution and obtained 84.5 recall@1 on MSLS (compared to 85.2 when trained on SF-XL with resolution 512x512 as in your paper).

from mixvpr.

wpumain commented on August 26, 2024

What does a fixed split mean?
#5 (comment)

from mixvpr.

Reproducing results on MSLS about mixvpr HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent