mihaidusmanu / d2-net Goto Github PK

View Code? Open in Web Editor NEW

759.0 26.0 162.0 2.34 MB

D2-Net: A Trainable CNN for Joint Description and Detection of Local Features

License: Other

Python 8.94% Shell 0.17% MATLAB 0.93% Jupyter Notebook 89.96%

cvpr2019 local-features visual-localization cnn pytorch

d2-net's People

Contributors

Stargazers

Watchers

Forkers

locussam nunofernandes-plight kelvinson xxiaofee xxlxsyhl wishinger-li gninnur 1165048017 hitimo yangsuhui zhouyao4321 sunpeng981712364 vprzhou areslp peterzhousz etrulls firskey abenbihi ubiquity6 githubfragments ronnie-tian alexadlu zhy07013216 zebrajack zhichen902 mm10086 dontlovebugs rodrigogantier liwenwei123 fdarmon xmlyqing00 xiehongle jerome-revaud gabbysuwichaya sheldonhs holmes-alan enderych jinhwanlazy tomandjerry-408 sunbirddy lovekittynine riven314 dongzhenwhu dpsdl chwlsunny songtingt jinjingyi tiger20 yoffie dz306271098 eddieburning phymucs wingbit hanyx2000 tarekbouamer stefangachter daliangshi proothean hitzhangyu satoshirobatofujimoto harveyliufly mvetsch ming1993li rpautrat xiaoahang heiko134470 zhangyangang xwu4lab woojunepark zxp771 plkms ledduy610 houy1 imelekhov chlee98 lukasbures lyres xiangnanhe biruixing kushine jianghai0929 xxin08 ivipsourcecode chengwei920412 hit2sjtu lokhande-vishnu kafeiyin00 kshaonan yocabon zumbalamambo jingmouren euwen jinyummiao aniket-gujarathi woodpecker0 luxiya01 berooo zinotat eldadv deep-learning-20

d2-net's Issues

Enabling Visual Output

Is there a way to produce at least some visual output just like the paper shows

Or perhaps you could tell a quick hack for us which can enable some visualization? I mean, since you captured the images in the paper, you already have the module somehow in the code, it is just not used I guess.

P.S: The work looks great, and we would like to actually really invest our time in utilizing it. If you can help people in this regard, I believe you can get a lot of citations in the close future :)

Problem extracting multi-scale features

It crashes here, when operating on a byte() tensor and a boolean tensor. Works if you convert detections to .byte(). It was working the last time I tried, it could be a problem with library versions? I'm on torch 1.2.0 now. FYI.

Use of VGG16 pretrained weights

Hi Mihai Dusmanu,

Great work! I enjoyed your paper and thank you for releasing your codes too. While I am trying to reproduce your model by training by myself, I realized that your code in lib/model.py do not set the 'pretrained=True' when calling models.vgg16().
Is this intended or a bug? Are you finetuning only the last layer only while randomly initializing the weights for the previous layers?

While I manually set it to be true, my loss continuously reaches to NaN value. Debugging it, I found that some of division by max values in score computation actually causes dividing by zero. Haven't you face this issue? Could you tell me the best practice to reproduce your learned feature please?

Thanks,
Paul

Getting "scene_info" file.

Hey, could you please give me a download link which can get the "scene_info" file directly, I know it's a part of megadepth, but I just failed to download this data set, I have tried many times.

It will be very nice if you can give the link :)

Thanks!

How to understand the loss function???

Hi @mihaidusmanu ,I have read your great paper twice,but i cant figure out how the loss function works. In my opinion, when a correspondence c is not repeatable, then the score Sc(1)*Sc(2)=0.If in extreme cases, all corresponding points are not repeated, then the loss of the entire image pair is close to zero. If all the corresponding points can be repeated well, the weight between each corresponding point is 1 / C.
Another question, off-the-shelf means that the model doesnt train any more? and directly use the Pretrained weight, and can works well?
Thank you very much！！！

About the localization benchmark

Hi. I see the results on the localization benchmark for robotcar and cmu. Trying to reproduce the results. Here are some questions. 1) For colmap image_registrator, do you use incremental mapper such as bundle adjustment? Are there any special parameters you choose? Do you adopt extra global bundle adjustment as well (bundle_adjuster in colmap)? 2) It says that only the rear images are used. Does it mean only the dataset rear image are used as candidates? And why?

Thank you very much.

Request for undistorted reconstructions data

Hi! Your results of D2-Net are very impressive. I tried to retrain your network for better understanding. Could you provide me with the link to the undistorted reconstructions and aggregated scene information folder, please? That would be a great help for me. Thanks in advance.

Invalid Models Links

Hi,

The links for the models appear to be invalid.
Could you confirm this please?

Thank you.

What is the usage of edge detection in 'HardDetectionModule' in model_test.py?

Thanks for sharing such good implementation!
After learning from the code, I was confused about the edge detection part in 'HardDetectionModule' in file 'model_test.py'. The code goes like:
detected = torch.min( is_depth_wise_max, torch.min(is_local_max, is_not_edge) )
Why edges should not be consider as detection? Why feature extracted by VGG16 can be used as the input of edge detection? What's the idea behind it?

Why the MMA is so low when threshold lower than 4px?

In Figure 4 of your paper, we can see that the trained D2-Net have low MMA when the threshold is small? It seems strange since it's even performs worse than the not trained version.

hello,i want to ask what the mean of the path 'phoenix/S6/zl548/MegaDepth_v1' mentioned in undistort_reconstructions.py,i can't find this path in the megadepth_v1_sfm dataset

Question about the training convergence

Dear Mihai Dusmanu！
Sorry to bother you again! I have two question about the details of the code.
The first question is,I try to modify the feature extraction network part, like use Depth separable convolution to accelerate the model.However, when i train the new model,after training the first epoch,the loss remains at 1.000002 and no longer changes anymore.I change many times and it always remains at 1.000002.Have you ever encountered this problem?I can't figure out the reason.
The second question is ,I find that two different D2_Net definition in model.py and model_test.py, separately used in train.py and extract_features.py.And there are some differences between them in the backbone part.For example,the one use average pooling and the other use max pooling.Is the changes in model architecture necessary for the different use like training and extracting feature?
Thanks ,Anna!

ps:the log file
[train] epoch 1 - batch 11300 / 11600 - avg_loss: 1.054927
[train] epoch 1 - batch 11400 / 11600 - avg_loss: 1.054472
[train] epoch 1 - batch 11500 / 11600 - avg_loss: 1.054008
[train] epoch 1 - avg_loss: 1.053536
[valid] epoch 1 - batch 0 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 100 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 200 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 300 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 400 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 500 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 600 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 700 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 800 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 900 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1000 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1100 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1200 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1400 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1500 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1600 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1700 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1800 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - avg_loss: 1.000002
[train] epoch 2 - batch 0 / 11600 - avg_loss: 1.000002
[train] epoch 2 - batch 100 / 11600 - avg_loss: 1.000002
[train] epoch 2 - batch 200 / 11600 - avg_loss: 1.000002
[train] epoch 2 - batch 400 / 11600 - avg_loss: 1.000002
[train] epoch 2 - batch 500 / 11600 - avg_loss: 1.000002

Is depth map really necessary?

Hi, when you make preprocess MegaDepth for training, why don't you just use the 2D points which projected from the matched 3D points as key points? Using depth map, as you did in the paper, will generate more key points, but will also involve key points which are less reliable, such as tree leafages. What's your opinion? Thanks a lot.

How can I achieve acceleration in the feature matching phase?

The feature extraction phase can be based on GPU acceleration. How can I achieve acceleration in the feature matching phase?
Thank you!

The different between SoftDetectionModule and HardDetectionModule

The training phase in the open source code uses the SoftDetectionModule in model.py. The test uses the HardDetectionModule of model_test.py. What is the difference?
Thanks.

Question about HPatches_Sequences_Matching_Benchmark evaluation.

Dear Mihai Dusmanu!
I am sorry to disturb you again. Can you provide third-party models you used in your paper, such as hesaff, hesaffnet, delf, delf-new, and superpoint? If you upload these models, It will maintain an objective effect as these methods are always updated. In addition, it will help us research your work and others.
In addition, can you provide the code to extract features and descriptors for these third-party models? Because there are less than 2000 features are extracted on the SLAM (simultaneous localization and mapping) system. Usually, there will be less than 1000 features.
Thank you, sincerely.

Preprocessed data

While I had another issue open already, I am writing a new one as it is on completely different topic.

In the scene info generation during the preprocessing pipeline, I see that you are using the images obtained from the undistortion. Why don't you use the images in MegaDepth? As they match with their corresponding depth maps, they should be also undistorted. The number of images in the dataset is smaller than the number of all undistorted images as the dataset only contains the ones that have a depth map. Is there any special reason for this?

Thanks!

About the evaluation in visuallocalizationbenchmark

Hello again. I am trying to reproduce the evaluation part of d2-net.
I manage to extract the feature and save them into npz files.
I simply use 'extract_features.py' for extracting the feature.
However, after importing the features and the matching,
the pipeline fails in the geometric reconstruction step
which also cause reconstruction failure.

Can you show me which part I do wrong? Thank you.

==============================================================================
Triangulating image #1112
==============================================================================

  => Image has 0 / 0 points
  => Triangulated 0 points

==============================================================================
Triangulating image #29
==============================================================================

  => Image has 0 / 0 points
  => Triangulated 0 points

==============================================================================
Triangulating image #3542
==============================================================================

  => Image has 0 / 0 points
  => Triangulated 0 points

==============================================================================

Setting --preprocessing image for training and testing phase

Thanks for this great work! It all been doing so well.
At the moment, I have tried to tune the weights of the d2-net downloaded from https://github.com/mihaidusmanu/d2-net#downloading-the-models.

I wonder which image preprocessing (caffe / torch ) option I should use for training and then testing ?

I am confused because it seems that originally the modeled provided from https://github.com/mihaidusmanu/d2-net#downloading-the-models were trained using "Caffe".
So, from my understanding, if I were to use the provided model directly, I should set the image proprocessing to "Caffe". Is that right?

However, when it comes to tuning the model.
I tuned them in Pytorch, but does this mean that I should use "torch" (rather than caffe) in my training ?
Then, as I want to also test this model, I have to set the preprocessing as torch ... Is this right?

Bug in HPatches evaluation notebook

@mihaidusmanu : thanks for the notebook to reproduce results from your paper. It seems like the names of DELF and "hesaffnet" got switched in your plot?

methods = ['hesaff', 'hesaffnet', 'delf', 'superpoint', 'lf-net', 'd2-net', 'd2-net-ms', 'd2-net-trained', 'd2-net-trained-ms']
names = ['Hes. Aff. + Root-SIFT', 'DELF', 'HAN + HN++', 'SuperPoint', 'LF-Net', 'D2-Net', 'D2-Net MS', 'D2-Net Trained', 'D2-Net Trained MS']

Could you please fix that in your notebook when you get a chance? Thanks

Download problem（Undistorted_sfm）

Hi mihaidusmanu,
When I downloaded Undistorted_SfM from Google drive,only some files are automatically compressed for download, and then do I need to download the *.tar.gz one by one.How can I download all at once?

Data download error

Thank you for your good work. When I tried to download MegaDepth v1 Dataset for many times, but the data bag is very big ((ar.gz, 199 GB). Do you have another address to download this data part by part? Thank you.

Smaller Subset Data Available?

I am trying to run through your code for training. (Because I would like to inspect the data structure of training data, especially how correspondence points are represented and stored.)

Due to limitation of my disk space, I downloaded a subset of data (selected subfolders) from your Google Drive, but I have difficulty downloading the whole MegaDepth v1 Dataset. It is too big.

Do you have a small subset of MegaDepth data available? If so, would you mind sharing them to us for fast experimentation? Thank you!

Whether to use ReLU for the last layer

Hi, In the supplementary material of your paper, I found the following statement:

We noticed that ReLU has a significant negative impact on the off-the- shelf descriptors, but not on the fine-tuned ones. Thus, we report results without ReLU for the off-the-shelf model and with ReLU for the fine-tuned one.

So the meaning is that for the off-the-shelf model, you use the feature before ReLU to calculate the detection score, while for the fine-tuned stage, you use the feature after ReLU to calculate the score and train the model? Is my understanding right ?

And could your please offer some insight on why this ReLU layer matters ? I think if no ReLU is used, the score might be less than 0 ( if a pixel has a feature vector with all elements less that zero)

Thanks.

Keypoint Scores

In an attempt to try extract the N best keypoints from a set of results I sorted in descending order using: keypoints = np.asarray([row for _, row in sorted(zip(scores, keypoints), key=lambda pair: pair[0])]).squeeze() after the results are obtained. I then cut this array taking only the top N keypoints: keypoints = keypoints[:N,:]. The results are not what I expected. I am using the assumption that the higher scores are better. Below is an example of the highest 300 keypoints.

I am surprised to see so many keypoints on the texture poor section on the right. I have used the model_ots.pth and model_tf.pth and they both receive similar results. Interestingly if i sort using ascending order I get more reasonable results:

In this case we see the 300 points with the lowest scores. All points are within the texture rich areas. Is this something you would expect or what is your intuition in this case?

Thanks,
Patrick

the positive distance is consistent with the paper L2 distance

Hi mihaidusmanu，
In the paper，the p(c)=||dA-dB||2.In the code,loss.py 86 line,
positive_distance = 2 - 2 * (
descriptors1.t().unsqueeze(1) @ descriptors2.t().unsqueeze(2)
).squeeze().
They are not equal.

Inliers for Less textured Images

Hi,

I am planning to use this architecture to find the feature points and subsequently inliers after matching for pair of images that consist of less texture.

The pre-trained model is returning zero inliers.

Does this mean, I have to perform the training with such images or do you have a better suggestion?

Thank you.

D2-Net feature matching

Hi,can you show me a demo about the D2-Net feature match.For I want compare it with other features and demonstrate the robustness of D2-Net in varying conditions.
Another question I want to ask you.If I want to compare your method to some other methods,such as DELF、LF-Net、SuperPoint and so on.How can I do the process of evalution of their performance on changing environments and draw a curve in the coordinate axis,what is the definition of X axis and Y axis?
Looking forward your reply,thank you.

How to achieve matching acceleration

1.In the example "/d2-net/qualitative/Qualitative-Matches.ipynb",It takes 1 minute for the two extracted features to match.How to achieve matching acceleration?

2.what is the difference between "d2_ots.pth","d2_tf.pth",and "d2_tf_no_phototourism.pth"?
thank you!

Test Network Arch.

Hi,

Thanks for publishing your work, it is very interesting to review.

One question I have is about the testing time network configuration. In section 4.3 in the paper you mention a number of changes for the network for testing. Specifically,

replace pool3 with an average pooling layer with stride 1.
Conv4_1 -> Conv4_3 is replaced with dilated convolutions with a rate of two.
In your code I was following the extract_features.py script which uses the D2Net model and as far as I understand this is configured differently, i.e. the truncated VGG16 net for the dense portion. It would be great if you could clarify this for me.

Kind Regards,
Patrick

MegaDepth

Hi,I have already downloaded MegaDepth dataset, still need to download sfm model? this is too big, except download sfm, is there a good way? thanks

bash preprocess_megadepth.sh runtime warning

I am trying to preprocess the megadepth dataset. However, by running the bash script I get a lot of RuntimeWarnings. Is this normal?

0004 0005 0007 0008 preprocess_scene.py:158: RuntimeWarning: invalid value encountered in arccos angles = np.arccos(np.dot(principal_axis, np.transpose(principal_axis))) / np.pi * 180 preprocess_scene.py:158: RuntimeWarning: invalid value encountered in arccos angles = np.arccos(np.dot(principal_axis, np.transpose(principal_axis))) / np.pi * 180 preprocess_scene.py:158: RuntimeWarning: invalid value encountered in arccos angles = np.arccos(np.dot(principal_axis, np.transpose(principal_axis))) / np.pi * 180 0011 preprocess_scene.py:158: RuntimeWarning: invalid value encountered in arccos angles = np.arccos(np.dot(principal_axis, np.transpose(principal_axis))) / np.pi * 180 0012

Thanks

Could D2-Net learn features that are rotation invariant?

Thanks for the great work! I would like to train d2-net for a problem but the problem involves image pairs with rotation transformation.

(Additionally, the images are actually karyotype image. The image pair contains a large raw karyotype image with a lot of overlapping chromosomes, and a small image with one chromosome instance perfectly cropped i.e its background removed. Also, two images have the same scale)

Do you have any insight if D2-Net is a good solution to this? Thank you!

About the speed in feature extraction process?

Thank you so much for your perfect work。

Multi-scale feature extraction requires 12G of memory.4-5 seconds on a M40。
1.8w feature vector, it takes 20s for GPU to match。
How to reduce the memory and speed up the process of feature extraction and matching? Thank you！

what is the difference between "d2_ots.pth","d2_tf.pth",and "d2_tf_no_phototourism.pth"?

Dear Mihai, first of all thanks for your great work.
1.what is the difference between "d2_ots.pth","d2_tf.pth",and "d2_tf_no_phototourism.pth"?
2.Can you open source D2Net based on the ResNet model?

Thank you!

Question about calculating the local max score.

Hi, Thanks for your sharing. In your paper, the definition of soft local max score is

But in your implementation I found

    def forward(self, batch):
        b = batch.size(0)

        batch = F.relu(batch)

        max_per_sample = torch.max(batch.view(b, -1), dim=1)[0]
        exp = torch.exp(batch / max_per_sample.view(b, 1, 1, 1))
        sum_exp = (
            self.soft_local_max_size ** 2 *
            F.avg_pool2d(
                F.pad(exp, [self.pad] * 4, mode='constant', value=1.),
                self.soft_local_max_size, stride=1
            )
        )
        local_max_score = exp / sum_exp

which means that you first normalize the feature vector by dividing the max of the whole image. But I am not clear why this normalization should be done and why you does not do this kind of normalization when calculaing the channel selection score ?

Performance of trained model weights

Hi,
thanks for making your code public.
I was wondering if there are performance stats to the trained model weights you offer? Would be great for deciding which weights to use for experiments.

About the speed in feature extraction process

Hello, great paper.
Just want to ask about the feature extraction speed.
I try to extract d2-net feature in Aachen day-night dataset.
However, it takes about around 20~30 secs per image, even with the multi-scale option is off.
I wonder is it normal to observe such phenomenon?
After tracing the code, it seems like the problem could be caused by the feature interpolation.
Do you have any comment on these things? Thanks.

Question about preparing the training data.

Hi, Thanks for your sharing. I have a question about your paper. In 4.3 part Implementation details, you mentioned that

For each pair, we selected a random 256 × 256 crop centered around one correspondence. We use a batch size of 1 and make sure that the training pairs present at least 128 correspondences in order to obtain meaningful gradients.

In my understanding, your model needs pairs of images and pixel-wise correspondences as input. And then you densely extract the feature vector for each pixel and calculate the loss based on the correspondence information. Then why you need to crop the images insteaded of use the origin image? Is is because the memory issue ? and what do you mean by a random 256 * 256 crop centered around one correspondence? around which correspondence?

After that, do you choose a fixed number of pixel-wise correspondence as your positive pair or use all of them? and how can you guarantee this two cropped images have at least 128 correspondence pixel pairs ?

Thanks a lot !

Comparing features with same dimensionality

Thanks for this very interesting paper, and congrats on the CVPR acceptance!

I have a follow-up question: I believe in your experiments the different compared feature descriptors have different dimensionalities. For example, DELF's default is 40D, while the D2-Net descriptor is 512D (and 128D for most of the other approaches), so this can have a large impact in the memory stored by the system. Did you perform experiments by using the same dimensionality for different descriptors? Since the performance of different techniques are quite close in the different experiments, I am wondering if this could be playing an important role here.

One quick way to try this out for DELF features is to tune the dimensionality, which can be done by simply changing the pca_dim parameter in the DELF config: https://github.com/tensorflow/models/blob/master/research/delf/delf/python/examples/delf_config_example.pbtxt#L22

Tensorflow version

Hi, Thanks for your sharing. I wonder could you please share a tensorflow version of D2-Net ?

Using MobileNet instead of VGG16

Dear Mihai, first of all thanks for your great work. I am currently trying to get D2 faster, and I read your suggestion to use MobileNet in a different github issue. My application needs a certain precision in feature position. The current trunkated VGG16 system scales down the input image resolution by ~factor 4 (e.g.: input 640x640 => dense_features = 159x159), which is the minimum I can afford. Translating the spatial downscaling to MobileNetV2, I would have to truncate its structure pretty early, after layer 3 (the 2nd bottleneck layer). Do you have any insight if that would make any sense? Or could it be that using MobileNet(V2) is not a good candidate for a feature-precise D2-version, due to its early drastic spatial reduction?

trained model src

Hi,
Thanks for the excellent work!
I m having trouble finding how you trained the following:
models/d2_ots.pth
models/d2_tf.pth
Pardon me if I have missed the info in the paper or the github details
regards
QF

Unstable training on subset of megadepth

Hi,

I was wondering if you had insight into why I run into NaNs while training on a subset (from 10-50 scenes) of Megadepth data - specifically, training as suggested in the readme on the 113 ish MD scenes works as expected. However, when the only change is training set size, I get 0s appearing in depth_wise_max in model.py, causing the score to get a div0 NaN. Is this a symptom of overfitting or something like that?

Also, are weights randomly initialized? In this repo, even in the extractor module, when you pull down the vgg head, it seems there's no pretrained=True flag.

An error occurred when extracting multiscale features

When I try to extract multiscale features, an error occurred like this:

Traceback (most recent call last):
File "extract_features.py", line 120, in
model
File "/data1/stx/D2-net/d2-net-master/lib/pyramid.py", line 46, in process_multiscale
detections = torch.min(detections, (1 - banned))
RuntimeError: Expected object of scalar type Bool but got scalar type Byte for argument #2 'other' in call to _th_min

My python version is 3.6.6, pytorch version is 1.3.0

Overlap matrix resulting in many -1.

In megadepth_utils/preprocess_scene.py line 209, the inner loop's scope is defined as for idx2 in range(idx1 + 1, n_images):.
This results in the lower triangle of the overlap matrix being all -1, hence drastically reducing the number of positive pairs. Could you confirm this is the case? Thanks

How does D2-Net implement 3D reconstruction

How does D2-Net implement 3D reconstruction, which is implemented on the COLMAP platform? How is the .d2-net data format linked to COLMAP?
Thank you

About Hpatches

Hello. Just want to ask the Hpatches eval. In the file there are

n_i=52
n_v=56

But I count the dataset and it seems like

n_i=57
n_v=59

Is there something I missed? Thanks

HPatches evaluation tool

@mihaidusmanu Very nice work. Thanks for sharing your implementation!
Would it be possible for you to include an implementation for HPatches evaluation in this repo?

Thanks,

-Vimal

Question about the training result

Dear Mihai Dusmanu!
Thank you for your answer last time! I have run the code you provided , after training the default 10 epoch, i got the model saved in 'checkpoints/d2.best.pth', and replace the model downloaded in 'models/d2_tf.pth' in extract_feature.py. However,when i want to test the training result in HPatches-Sequences-Matching-Benchmark.ipynb,the feature and mathes i compute by the model in d2.best.pth is less than the ones in d2-net-trained precomputed errors.When i run Qualitative-Matches.ipynb to see the result,i found the plot result is also not as good as the old model:d2_tf.pth. i am wondering if there is any step i missed? How can i train the same result as model in d2_tf.pth?
Thanks Sincerely!