mihaidusmanu / d2-net Goto Github PK
View Code? Open in Web Editor NEWD2-Net: A Trainable CNN for Joint Description and Detection of Local Features
License: Other
D2-Net: A Trainable CNN for Joint Description and Detection of Local Features
License: Other
Is there a way to produce at least some visual output just like the paper shows
Or perhaps you could tell a quick hack for us which can enable some visualization? I mean, since you captured the images in the paper, you already have the module somehow in the code, it is just not used I guess.
P.S: The work looks great, and we would like to actually really invest our time in utilizing it. If you can help people in this regard, I believe you can get a lot of citations in the close future :)
It crashes here, when operating on a byte() tensor and a boolean tensor. Works if you convert detections to .byte(). It was working the last time I tried, it could be a problem with library versions? I'm on torch 1.2.0 now. FYI.
Hi Mihai Dusmanu,
Great work! I enjoyed your paper and thank you for releasing your codes too. While I am trying to reproduce your model by training by myself, I realized that your code in lib/model.py do not set the 'pretrained=True' when calling models.vgg16().
Is this intended or a bug? Are you finetuning only the last layer only while randomly initializing the weights for the previous layers?
While I manually set it to be true, my loss continuously reaches to NaN value. Debugging it, I found that some of division by max values in score computation actually causes dividing by zero. Haven't you face this issue? Could you tell me the best practice to reproduce your learned feature please?
Thanks,
Paul
Hey, could you please give me a download link which can get the "scene_info" file directly, I know it's a part of megadepth, but I just failed to download this data set, I have tried many times.
It will be very nice if you can give the link :)
Thanks!
Hi @mihaidusmanu ,I have read your great paper twice,but i cant figure out how the loss function works. In my opinion, when a correspondence c is not repeatable, then the score Sc(1)*Sc(2)=0.If in extreme cases, all corresponding points are not repeated, then the loss of the entire image pair is close to zero. If all the corresponding points can be repeated well, the weight between each corresponding point is 1 / C.
Another question, off-the-shelf means that the model doesnt train any more? and directly use the Pretrained weight, and can works well?
Thank you very much!!!
Hi. I see the results on the localization benchmark for robotcar and cmu. Trying to reproduce the results. Here are some questions. 1) For colmap image_registrator, do you use incremental mapper such as bundle adjustment? Are there any special parameters you choose? Do you adopt extra global bundle adjustment as well (bundle_adjuster in colmap)? 2) It says that only the rear images are used. Does it mean only the dataset rear image are used as candidates? And why?
Thank you very much.
Hi! Your results of D2-Net are very impressive. I tried to retrain your network for better understanding. Could you provide me with the link to the undistorted reconstructions and aggregated scene information folder, please? That would be a great help for me. Thanks in advance.
Hi,
The links for the models appear to be invalid.
Could you confirm this please?
Thank you.
Thanks for sharing such good implementation!
After learning from the code, I was confused about the edge detection part in 'HardDetectionModule' in file 'model_test.py'. The code goes like:
detected = torch.min( is_depth_wise_max, torch.min(is_local_max, is_not_edge) )
Why edges should not be consider as detection? Why feature extracted by VGG16 can be used as the input of edge detection? What's the idea behind it?
In Figure 4 of your paper, we can see that the trained D2-Net have low MMA when the threshold is small? It seems strange since it's even performs worse than the not trained version.
Dear Mihai Dusmanu!
Sorry to bother you again! I have two question about the details of the code.
The first question is,I try to modify the feature extraction network part, like use Depth separable convolution to accelerate the model.However, when i train the new model,after training the first epoch,the loss remains at 1.000002 and no longer changes anymore.I change many times and it always remains at 1.000002.Have you ever encountered this problem?I can't figure out the reason.
The second question is ,I find that two different D2_Net definition in model.py and model_test.py, separately used in train.py and extract_features.py.And there are some differences between them in the backbone part.For example,the one use average pooling and the other use max pooling.Is the changes in model architecture necessary for the different use like training and extracting feature?
Thanks ,Anna!
ps:the log file
[train] epoch 1 - batch 11300 / 11600 - avg_loss: 1.054927
[train] epoch 1 - batch 11400 / 11600 - avg_loss: 1.054472
[train] epoch 1 - batch 11500 / 11600 - avg_loss: 1.054008
[train] epoch 1 - avg_loss: 1.053536
[valid] epoch 1 - batch 0 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 100 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 200 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 300 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 400 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 500 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 600 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 700 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 800 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 900 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1000 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1100 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1200 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1400 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1500 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1600 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1700 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1800 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - avg_loss: 1.000002
[train] epoch 2 - batch 0 / 11600 - avg_loss: 1.000002
[train] epoch 2 - batch 100 / 11600 - avg_loss: 1.000002
[train] epoch 2 - batch 200 / 11600 - avg_loss: 1.000002
[train] epoch 2 - batch 400 / 11600 - avg_loss: 1.000002
[train] epoch 2 - batch 500 / 11600 - avg_loss: 1.000002
Hi, when you make preprocess MegaDepth for training, why don't you just use the 2D points which projected from the matched 3D points as key points? Using depth map, as you did in the paper, will generate more key points, but will also involve key points which are less reliable, such as tree leafages. What's your opinion? Thanks a lot.
The feature extraction phase can be based on GPU acceleration. How can I achieve acceleration in the feature matching phase?
Thank you!
The training phase in the open source code uses the SoftDetectionModule in model.py. The test uses the HardDetectionModule of model_test.py. What is the difference?
Thanks.
Dear Mihai Dusmanu!
I am sorry to disturb you again. Can you provide third-party models you used in your paper, such as hesaff, hesaffnet, delf, delf-new, and superpoint? If you upload these models, It will maintain an objective effect as these methods are always updated. In addition, it will help us research your work and others.
In addition, can you provide the code to extract features and descriptors for these third-party models? Because there are less than 2000 features are extracted on the SLAM (simultaneous localization and mapping) system. Usually, there will be less than 1000 features.
Thank you, sincerely.
While I had another issue open already, I am writing a new one as it is on completely different topic.
In the scene info generation during the preprocessing pipeline, I see that you are using the images obtained from the undistortion. Why don't you use the images in MegaDepth? As they match with their corresponding depth maps, they should be also undistorted. The number of images in the dataset is smaller than the number of all undistorted images as the dataset only contains the ones that have a depth map. Is there any special reason for this?
Thanks!
Hello again. I am trying to reproduce the evaluation part of d2-net.
I manage to extract the feature and save them into npz files.
I simply use 'extract_features.py' for extracting the feature.
However, after importing the features and the matching,
the pipeline fails in the geometric reconstruction step
which also cause reconstruction failure.
Can you show me which part I do wrong? Thank you.
==============================================================================
Triangulating image #1112
==============================================================================
=> Image has 0 / 0 points
=> Triangulated 0 points
==============================================================================
Triangulating image #29
==============================================================================
=> Image has 0 / 0 points
=> Triangulated 0 points
==============================================================================
Triangulating image #3542
==============================================================================
=> Image has 0 / 0 points
=> Triangulated 0 points
==============================================================================
Thanks for this great work! It all been doing so well.
At the moment, I have tried to tune the weights of the d2-net downloaded from https://github.com/mihaidusmanu/d2-net#downloading-the-models.
I wonder which image preprocessing (caffe / torch ) option I should use for training and then testing ?
I am confused because it seems that originally the modeled provided from https://github.com/mihaidusmanu/d2-net#downloading-the-models were trained using "Caffe".
So, from my understanding, if I were to use the provided model directly, I should set the image proprocessing to "Caffe". Is that right?
However, when it comes to tuning the model.
I tuned them in Pytorch, but does this mean that I should use "torch" (rather than caffe) in my training ?
Then, as I want to also test this model, I have to set the preprocessing as torch ... Is this right?
@mihaidusmanu : thanks for the notebook to reproduce results from your paper. It seems like the names of DELF and "hesaffnet" got switched in your plot?
methods = ['hesaff', 'hesaffnet', 'delf', 'superpoint', 'lf-net', 'd2-net', 'd2-net-ms', 'd2-net-trained', 'd2-net-trained-ms']
names = ['Hes. Aff. + Root-SIFT', 'DELF', 'HAN + HN++', 'SuperPoint', 'LF-Net', 'D2-Net', 'D2-Net MS', 'D2-Net Trained', 'D2-Net Trained MS']
Could you please fix that in your notebook when you get a chance? Thanks
Hi mihaidusmanu,
When I downloaded Undistorted_SfM from Google drive,only some files are automatically compressed for download, and then do I need to download the *.tar.gz one by one.How can I download all at once?
Thank you for your good work. When I tried to download MegaDepth v1 Dataset for many times, but the data bag is very big ((ar.gz, 199 GB). Do you have another address to download this data part by part? Thank you.
I am trying to run through your code for training. (Because I would like to inspect the data structure of training data, especially how correspondence points are represented and stored.)
Due to limitation of my disk space, I downloaded a subset of data (selected subfolders) from your Google Drive, but I have difficulty downloading the whole MegaDepth v1 Dataset. It is too big.
Do you have a small subset of MegaDepth data available? If so, would you mind sharing them to us for fast experimentation? Thank you!
Hi, In the supplementary material of your paper, I found the following statement:
We noticed that ReLU has a significant negative impact on the off-the- shelf descriptors, but not on the fine-tuned ones. Thus, we report results without ReLU for the off-the-shelf model and with ReLU for the fine-tuned one.
So the meaning is that for the off-the-shelf model, you use the feature before ReLU to calculate the detection score, while for the fine-tuned stage, you use the feature after ReLU to calculate the score and train the model? Is my understanding right ?
And could your please offer some insight on why this ReLU layer matters ? I think if no ReLU is used, the score might be less than 0 ( if a pixel has a feature vector with all elements less that zero)
Thanks.
In an attempt to try extract the N best keypoints from a set of results I sorted in descending order using: keypoints = np.asarray([row for _, row in sorted(zip(scores, keypoints), key=lambda pair: pair[0])]).squeeze()
after the results are obtained. I then cut this array taking only the top N keypoints: keypoints = keypoints[:N,:]
. The results are not what I expected. I am using the assumption that the higher scores are better. Below is an example of the highest 300 keypoints.
I am surprised to see so many keypoints on the texture poor section on the right. I have used the model_ots.pth
and model_tf.pth
and they both receive similar results. Interestingly if i sort using ascending order I get more reasonable results:
In this case we see the 300 points with the lowest scores. All points are within the texture rich areas. Is this something you would expect or what is your intuition in this case?
Thanks,
Patrick
Hi mihaidusmanu,
In the paper,the p(c)=||dA-dB||2.In the code,loss.py 86 line,
positive_distance = 2 - 2 * (
descriptors1.t().unsqueeze(1) @ descriptors2.t().unsqueeze(2)
).squeeze().
They are not equal.
Hi,
I am planning to use this architecture to find the feature points and subsequently inliers after matching for pair of images that consist of less texture.
The pre-trained model is returning zero inliers.
Does this mean, I have to perform the training with such images or do you have a better suggestion?
Thank you.
Hi,can you show me a demo about the D2-Net feature match.For I want compare it with other features and demonstrate the robustness of D2-Net in varying conditions.
Another question I want to ask you.If I want to compare your method to some other methods,such as DELF、LF-Net、SuperPoint and so on.How can I do the process of evalution of their performance on changing environments and draw a curve in the coordinate axis,what is the definition of X axis and Y axis?
Looking forward your reply,thank you.
1.In the example "/d2-net/qualitative/Qualitative-Matches.ipynb",It takes 1 minute for the two extracted features to match.How to achieve matching acceleration?
2.what is the difference between "d2_ots.pth","d2_tf.pth",and "d2_tf_no_phototourism.pth"?
thank you!
Hi,
Thanks for publishing your work, it is very interesting to review.
One question I have is about the testing time network configuration. In section 4.3 in the paper you mention a number of changes for the network for testing. Specifically,
Kind Regards,
Patrick
Hi,I have already downloaded MegaDepth dataset, still need to download sfm model? this is too big, except download sfm, is there a good way? thanks
I am trying to preprocess the megadepth dataset. However, by running the bash script I get a lot of RuntimeWarnings. Is this normal?
0004 0005 0007 0008 preprocess_scene.py:158: RuntimeWarning: invalid value encountered in arccos angles = np.arccos(np.dot(principal_axis, np.transpose(principal_axis))) / np.pi * 180 preprocess_scene.py:158: RuntimeWarning: invalid value encountered in arccos angles = np.arccos(np.dot(principal_axis, np.transpose(principal_axis))) / np.pi * 180 preprocess_scene.py:158: RuntimeWarning: invalid value encountered in arccos angles = np.arccos(np.dot(principal_axis, np.transpose(principal_axis))) / np.pi * 180 0011 preprocess_scene.py:158: RuntimeWarning: invalid value encountered in arccos angles = np.arccos(np.dot(principal_axis, np.transpose(principal_axis))) / np.pi * 180 0012
Thanks
Thanks for the great work! I would like to train d2-net for a problem but the problem involves image pairs with rotation transformation.
(Additionally, the images are actually karyotype image. The image pair contains a large raw karyotype image with a lot of overlapping chromosomes, and a small image with one chromosome instance perfectly cropped i.e its background removed. Also, two images have the same scale)
Do you have any insight if D2-Net is a good solution to this? Thank you!
Thank you so much for your perfect work。
Multi-scale feature extraction requires 12G of memory.4-5 seconds on a M40。
1.8w feature vector, it takes 20s for GPU to match。
How to reduce the memory and speed up the process of feature extraction and matching? Thank you!
Dear Mihai, first of all thanks for your great work.
1.what is the difference between "d2_ots.pth","d2_tf.pth",and "d2_tf_no_phototourism.pth"?
2.Can you open source D2Net based on the ResNet model?
Thank you!
Hi, Thanks for your sharing. In your paper, the definition of soft local max score is
But in your implementation I found
def forward(self, batch):
b = batch.size(0)
batch = F.relu(batch)
max_per_sample = torch.max(batch.view(b, -1), dim=1)[0]
exp = torch.exp(batch / max_per_sample.view(b, 1, 1, 1))
sum_exp = (
self.soft_local_max_size ** 2 *
F.avg_pool2d(
F.pad(exp, [self.pad] * 4, mode='constant', value=1.),
self.soft_local_max_size, stride=1
)
)
local_max_score = exp / sum_exp
which means that you first normalize the feature vector by dividing the max of the whole image. But I am not clear why this normalization should be done and why you does not do this kind of normalization when calculaing the channel selection score ?
Hi,
thanks for making your code public.
I was wondering if there are performance stats to the trained model weights you offer? Would be great for deciding which weights to use for experiments.
Hello, great paper.
Just want to ask about the feature extraction speed.
I try to extract d2-net feature in Aachen day-night dataset.
However, it takes about around 20~30 secs per image, even with the multi-scale option is off.
I wonder is it normal to observe such phenomenon?
After tracing the code, it seems like the problem could be caused by the feature interpolation.
Do you have any comment on these things? Thanks.
Hi, Thanks for your sharing. I have a question about your paper. In 4.3 part Implementation details, you mentioned that
For each pair, we selected a random 256 × 256 crop centered around one correspondence. We use a batch size of 1 and make sure that the training pairs present at least 128 correspondences in order to obtain meaningful gradients.
In my understanding, your model needs pairs of images and pixel-wise correspondences as input. And then you densely extract the feature vector for each pixel and calculate the loss based on the correspondence information. Then why you need to crop the images insteaded of use the origin image? Is is because the memory issue ? and what do you mean by a random 256 * 256 crop centered around one correspondence?
around which correspondence?
After that, do you choose a fixed number of pixel-wise correspondence as your positive pair or use all of them? and how can you guarantee this two cropped images have at least 128 correspondence pixel pairs ?
Thanks a lot !
Thanks for this very interesting paper, and congrats on the CVPR acceptance!
I have a follow-up question: I believe in your experiments the different compared feature descriptors have different dimensionalities. For example, DELF's default is 40D
, while the D2-Net descriptor is 512D
(and 128D
for most of the other approaches), so this can have a large impact in the memory stored by the system. Did you perform experiments by using the same dimensionality for different descriptors? Since the performance of different techniques are quite close in the different experiments, I am wondering if this could be playing an important role here.
One quick way to try this out for DELF features is to tune the dimensionality, which can be done by simply changing the pca_dim
parameter in the DELF config: https://github.com/tensorflow/models/blob/master/research/delf/delf/python/examples/delf_config_example.pbtxt#L22
Hi, Thanks for your sharing. I wonder could you please share a tensorflow version of D2-Net ?
Dear Mihai, first of all thanks for your great work. I am currently trying to get D2 faster, and I read your suggestion to use MobileNet in a different github issue. My application needs a certain precision in feature position. The current trunkated VGG16 system scales down the input image resolution by ~factor 4 (e.g.: input 640x640 => dense_features = 159x159), which is the minimum I can afford. Translating the spatial downscaling to MobileNetV2, I would have to truncate its structure pretty early, after layer 3 (the 2nd bottleneck layer). Do you have any insight if that would make any sense? Or could it be that using MobileNet(V2) is not a good candidate for a feature-precise D2-version, due to its early drastic spatial reduction?
Hi,
Thanks for the excellent work!
I m having trouble finding how you trained the following:
models/d2_ots.pth
models/d2_tf.pth
Pardon me if I have missed the info in the paper or the github details
regards
QF
Hi,
I was wondering if you had insight into why I run into NaNs while training on a subset (from 10-50 scenes) of Megadepth data - specifically, training as suggested in the readme on the 113 ish MD scenes works as expected. However, when the only change is training set size, I get 0s appearing in depth_wise_max
in model.py
, causing the score to get a div0 NaN. Is this a symptom of overfitting or something like that?
Also, are weights randomly initialized? In this repo, even in the extractor module, when you pull down the vgg head, it seems there's no pretrained=True
flag.
When I try to extract multiscale features, an error occurred like this:
Traceback (most recent call last):
File "extract_features.py", line 120, in
model
File "/data1/stx/D2-net/d2-net-master/lib/pyramid.py", line 46, in process_multiscale
detections = torch.min(detections, (1 - banned))
RuntimeError: Expected object of scalar type Bool but got scalar type Byte for argument #2 'other' in call to _th_min
My python version is 3.6.6, pytorch version is 1.3.0
In megadepth_utils/preprocess_scene.py
line 209, the inner loop's scope is defined as for idx2 in range(idx1 + 1, n_images):
.
This results in the lower triangle of the overlap matrix being all -1, hence drastically reducing the number of positive pairs. Could you confirm this is the case? Thanks
How does D2-Net implement 3D reconstruction, which is implemented on the COLMAP platform? How is the .d2-net data format linked to COLMAP?
Thank you
Hello. Just want to ask the Hpatches eval. In the file there are
n_i=52
n_v=56
But I count the dataset and it seems like
n_i=57
n_v=59
Is there something I missed? Thanks
@mihaidusmanu Very nice work. Thanks for sharing your implementation!
Would it be possible for you to include an implementation for HPatches evaluation in this repo?
Thanks,
-Vimal
Dear Mihai Dusmanu!
Thank you for your answer last time! I have run the code you provided , after training the default 10 epoch, i got the model saved in 'checkpoints/d2.best.pth', and replace the model downloaded in 'models/d2_tf.pth' in extract_feature.py. However,when i want to test the training result in HPatches-Sequences-Matching-Benchmark.ipynb,the feature and mathes i compute by the model in d2.best.pth is less than the ones in d2-net-trained precomputed errors.When i run Qualitative-Matches.ipynb to see the result,i found the plot result is also not as good as the old model:d2_tf.pth. i am wondering if there is any step i missed? How can i train the same result as model in d2_tf.pth?
Thanks Sincerely!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.