celebrandil / cudasift Goto Github PK

A CUDA implementation of SIFT for NVidia GPUs (1.2 ms on a GTX 1060)

License: MIT License

Cuda 88.79% C++ 7.80% C 2.41% CMake 1.00%

cudasift's Introduction

CudaSift - SIFT features with CUDA

This is the fourth version of a SIFT (Scale Invariant Feature Transform) implementation using CUDA for GPUs from NVidia. The first version is from 2007 and GPUs have evolved since then. This version is slightly more precise and considerably faster than the previous versions and has been optimized for Kepler and later generations of GPUs.

On a GTX 1060 GPU the code takes about 1.2 ms on a 1280x960 pixel image and 1.7 ms on a 1920x1080 pixel image. There is also code for brute-force matching of features that takes about 2.2 ms for two sets of around 1900 SIFT features each.

The code relies on CMake for compilation and OpenCV for image containers. OpenCV can however be quite easily changed to something else. The code can be relatively hard to read, given the way things have been parallelized for maximum speed.

The code is free to use for non-commercial applications. If you use the code for research, please cite to the following paper.

M. Björkman, N. Bergström and D. Kragic, "Detecting, segmenting and tracking unknown objects using multi-label MRF inference", CVIU, 118, pp. 111-127, January 2014. ScienceDirect

Update in feature matching (2019-05-17)

The brute force feature matcher has been significantly improved in speed. The largest improvements can be seen for large feature sets with 10000 features or more, but as can be seen below, it performs rather well even with just 2000 features. The file match.pdf includes a description of the optimizations done in this version.

New version for Pascal (2018-10-26)

There is a new version optimized for Pascal cards, but it should work also on many older cards. Since it includes some bug fixes that changes slightly how features are extracted, which might affect matching to features extracted using an older version, the changes are kept in a new branch (Pascal). The fixes include a small change in ScaleDown that corrects an odd behaviour for images with heights not divisible by 2^(#octaves). The second change is a correction of an improper shift of (0.5,0.5) pixels, when pixel values were read from the image to create a descriptor.

Then there are some improvements in terms of speed, especially in the Laplace function, that detects DoG features, and the LowPass function, that is seen as preprocessing and is not included in the benchmarking below. Maybe surprisingly, even if optimizations were done with respect to Pascal cards, these improvements were even better for older cards. The changes involve trying to make each CUDA thread have more work to do, using fewer thread blocks. For typical images of today, there will be enough blocks to feed the streaming multiprocessors anyway.

		1280x960	1920x1080	GFLOPS	Bandwidth	Matching
Turing	GeForce RTX 2080 Ti	0.42*	0.56*	11750	616	0.30*
Pascal	GeForce GTX 1080 Ti	0.58*	0.80*	10609	484	0.42*
Pascal	GeForce GTX 1060	1.2	1.7	3855	192	2.2
Maxwell	GeForce GTX 970	1.3	1.8	3494	224	2.5
Kepler	Tesla K40c	2.4	3.4	4291	288	4.7

Benchmarking of new version (2018-08-22)

About every 2nd year, I try to update the code to gain even more speed through further optimization. Here are some results for a new version of the code. Improvements in speed have primarilly been gained by reducing communication between host and device, better balancing the load on caches, shared and global memory, and increasing the workload of each thread block.

		1280x960	1920x1080	GFLOPS	Bandwidth	Matching
Pascal	GeForce GTX 1080 Ti	0.7	1.0	10609	484	1.0
Pascal	GeForce GTX 1060	1.6	2.4	3855	192	2.2
Maxwell	GeForce GTX 970	1.9	2.8	3494	224	2.5
Kepler	Tesla K40c	3.1	4.7	4291	288	4.7
Kepler	GeForce GTX TITAN	2.9	4.3	4500	288	4.5

Matching is done between two sets of 1818 and 1978 features respectively.

It's questionable whether further optimization really makes sense, given that the cost of just transfering an 1920x1080 pixel image to the device takes about 1.4 ms on a GTX 1080 Ti. Even if the brute force feature matcher is not much faster than earlier versions, it does not have the same O(N^2) temporary memory overhead, which is preferable if there are many features.

Benchmarking of previous version (2017-05-24)

Computational cost (in milliseconds) on different GPUs:

		1280x960	1920x1080	GFLOPS	Bandwidth	Matching
Pascal	GeForce GTX 1080 Ti	1.7	2.3	10609	484	1.4
Pascal	GeForce GTX 1060	2.7	4.0	3855	192	2.6
Maxwell	GeForce GTX 970	3.8	5.6	3494	224	2.8
Kepler	Tesla K40c	5.4	8.0	4291	288	5.5
Kepler	GeForce GTX TITAN	4.4	6.6	4500	288	4.6

Matching is done between two sets of 1616 and 1769 features respectively.

The improvements in this version involved a slight adaptation for Pascal, changing from textures to global memory (mostly through L2) in the most costly function LaplaceMulti. The medium-end card GTX 1060 is impressive indeed.

Usage

There are two different containers for storing data on the host and on the device; SiftData for SIFT features and CudaImage for images. Since memory allocation on GPUs is slow, it's usually preferable to preallocate a sufficient amount of memory using InitSiftData(), in particular if SIFT features are extracted from a continuous stream of video camera images. On repeated calls ExtractSift() will reuse memory previously allocated.

#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <cudaImage.h>
#include <cudaSift.h>

/* Reserve memory space for a whole bunch of SIFT features. */
SiftData siftData;
InitSiftData(siftData, 25000, true, true);

/* Read image using OpenCV and convert to floating point. */
cv::Mat limg;
cv::imread("image.png", 0).convertTo(limg, CV32FC1);
/* Allocate 1280x960 pixel image with device side pitch of 1280 floats. */ 
/* Memory on host side already allocated by OpenCV is reused.           */
CudaImage img;
img.Allocate(1280, 960, 1280, false, NULL, (float*) limg.data);
/* Download image from host to device */
img.Download();

int numOctaves = 5;    /* Number of octaves in Gaussian pyramid */
float initBlur = 1.0f; /* Amount of initial Gaussian blurring in standard deviations */
float thresh = 3.5f;   /* Threshold on difference of Gaussians for feature pruning */
float minScale = 0.0f; /* Minimum acceptable scale to remove fine-scale features */
bool upScale = false;  /* Whether to upscale image before extraction */
/* Extract SIFT features */
ExtractSift(siftData, img, numOctaves, initBlur, thresh, minScale, upScale);
...
/* Free space allocated from SIFT features */
FreeSiftData(siftData);

Parameter setting

The requirements on number and quality of features vary from application to application. Some applications benefit from a smaller number of high quality features, while others require as many features as possible. More distinct features with higher DoG (difference of Gaussians) responses tend to be of higher quality and are easier to match between multiple views. With the parameter thresh a threshold can be set on the minimum DoG to prune features of less quality.

In many cases the most fine-scale features are of little use, especially when noise conditions are severe or when features are matched between very different views. In such cases the most fine-scale features can be pruned by setting minScale to the minimum acceptable feature scale, where 1.0 corresponds to the original image scale without upscaling. As a consequence of pruning the computational cost can also be reduced.

To increase the number of SIFT features, but also increase the computational cost, the original image can be automatically upscaled to double the size using the upScale parameter, in accordance to Lowe's recommendations. One should keep in mind though that by doing so the fraction of features that can be matched tend to go down, even if the total number of extracted features increases significantly. If it's enough to instead reduce the thresh parameter to get more features, that is often a better alternative.

Results without upscaling (upScale=False) of 1280x960 pixel input image.

thresh	#Matches	%Matches	Cost (ms)
1.0	4236	40.4%	5.8
1.5	3491	42.5%	5.2
2.0	2720	43.2%	4.7
2.5	2121	44.4%	4.2
3.0	1627	45.8%	3.9
3.5	1189	46.2%	3.6
4.0	881	48.5%	3.3

Results with upscaling (upScale=True) of 1280x960 pixel input image.

thresh	#Matches	%Matches	Cost (ms)
2.0	4502	34.9%	13.2
2.5	3389	35.9%	11.2
3.0	2529	37.1%	10.6
3.5	1841	38.3%	9.9
4.0	1331	39.8%	9.5
4.5	954	42.2%	9.3
5.0	611	39.3%	9.1

cudasift's People

Contributors

Stargazers

Watchers

Forkers

nbergst daniel-perry mdqyy robocyte biotrump arloz alperv flyerae wangshirufeng lyonszhao sh0 huluhaziqi qazwsx123456 cloudvision sxq2004123 mrgloom loliod errysobye thejonwong yonghoonkwon xhy20070406 lxmwust yxliang muzichao zjudzl jeneywang rhoana catree junjun2016 ahojnnes luofalei cequencer jiapei100 denlittlstar geoffreywang1990 liygcheng algorithm2d djiayong5 gbyfbi trigrass2 crisil xiaocenxiaocen nagyistge jlaura amos-zq johndpope mefathy helios-vmg saihv zhipengcai 3dreconstruction yxftju vbillys gcjyzdd templeblock oxygenshaka syzred hollisjoe h2016102 donmakmak danielkolsi ossdc sheun1 someshwaran kzampog cuijianzhu handsome3163 yzhua yskim041 cvutils rahulramanathan buaa1309lk maheshkha github2016-yuan moepforfreedom scholltan zuoshaobo airob xenuts harry159821 yuanjungod atammaroseeone jieeeeeeeeeee wwfnwg griffin4197 nightrain11 alicewy goodguy-dev rajathjavali alalemp ikingtec wait1988 czero69 gisbenweil runningj hs5530hs leeyangg yyf7329081 brothertook 10183308

cudasift's Issues

Hello, my graphics card, the architecture is Pascal, can you use your program?

Window10 GTX1060

Non-Deterministic result of CudaSift

Thank you for this implementation!

When I test the CudaSift, I found the result is non-deterministic, that is, for the same image, the program outputs different feature. Is it abnormal?

For example, I compile and run mainSift.cpp, and set the max sift num to 3, the result are:
Run1
xpos, ypos, score, descriptor[0], descriptor[1], descriptor[2]
1663.008179 215.823730 149.000000 0.001161 0.047922 0.060676
1235.375366 263.208405 111.000000 0.000000 0.059338 0.047047
1590.166992 1039.981323 201.000000 0.007714 0.001538 0.000000

Run2
xpos, ypos, score, descriptor[0], descriptor[1], descriptor[2]
1663.008179 215.823730 149.000000 0.001161 0.047922 0.060676
1660.701904 254.537323 111.000000 0.001048 0.023971 0.046532
1590.166992 1039.981323 201.000000 0.007714 0.001538 0.000000

Cannot find -lopencv_dep_cudart error

The following error shows up while doing sudo make install after cmake inside the build folder

[ 14%] Linking CXX executable cudasift
/usr/bin/ld: cannot find -lopencv_dep_cudart
collect2: error: ld returned 1 exit status
CMakeFiles/cudasift.dir/build.make:757: recipe for target 'cudasift' failed
make[2]: *** [cudasift] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/cudasift.dir/all' failed
make[1]: *** [CMakeFiles/cudasift.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2

saving descriptors

hello.
In a project I want to save keyPoints and descriptors in database.
with d_data and h_data in SiftData: how can I separate keypoints and descriptors and save them?

Why I can't get any match correspondence even with the same images?

Thank you for sharing this implementaion!

Environment

Platform:
Windows 10 64-bit, VS2015
OpenCV3.4.1
CUDA8
1060

Current Behavior:

I have successfully build this repository with no errors.
I have tested mainSift.cpp with the same image (limg and rimg are the same picture) many times, none of them get any match correspondence!
I never chenge the code of mainSift.cpp.

Here is My output of mainSift.cpp:

Image size = (812,725)
Initializing data...
Device Number: 0
  Device name: GeForce GTX 1060
  Memory Clock Rate (MHz): 4004
  Memory Bus Width (bits): 192
  Peak Memory Bandwidth (GB/s): 192.2

SIFT extraction time =        1.99 ms
Incl prefiltering & memcpy =  3.09 ms 2623

SIFT extraction time =        1.82 ms
Incl prefiltering & memcpy =  3.06 ms 2624

MatchSiftData time =          6.16 ms
...
MatchSiftData time =          4.85 ms
Number of original features: 2623 2624
Number of matching features: 0 0 0% 1 3

Expected Behavior:

I expect CUDAsift find the correct correspondence , and show the correspondence between model image and test image.

Besides, the finding homography process is based on both model image's feature and test image's feature. Why there is only siftData1 input to functionFindHomography and ImproveHomography in mainSift.cpp?

for (int i = 0; i<500; i++)
  MatchSiftData(siftData1, siftData2);
float homography[9];
int numMatches;
FindHomography(siftData1, homography, &numMatches, 10000, 0.00f, 0.80f, 5.0);
int numFit = ImproveHomography(siftData1, homography, 5, 0.00f, 0.80f, 3.0);

I have been troubling in above two problems several days, can you help me for any suggestion？
Thank you!

Orientation and descriptors are always extracted in the same Gaussian image inside one octave

Hi, I tried to read your code and found that you did not store the Gaussian smoothed images. In the original SIFT paper, orientation and descriptors are extracted from the Gaussian smoothed images with scale closest the key points. In your implementation, I see it's extracted from the first image of the octave (i.e. the only saved one). I understand this gives speed benefits, but did you thoroughly test scale/rotation invariance with this simplification?

Regards,
Yao

descriptor of keypoint seems not correct

Hi,
I use two images to test feature matching, the matched result seems wrong, the number of matched result is far away from the result of opencv version.
the two images are below, get 0 matches from CudaSift/mainSift.cpp,but got more than 100 matches from opencv version

CudaSift 3.0?

Hi @Celebrandil,

Thanks for the sweet library, with an agreeable license! It's quick to compile and easy to use, and I've started in on some changes that you may or may not want. If I'm heading in a direction you would rather not, simply ignore. My fork is here.

It's not PR ready at this point, but the main points are in there / ready to be changed if you have specific preferences.

CMake Changes

Separate out into library and executable.
- Goal is to support parent CMake projects
- Library does not need OpenCV, only application.
- Default action: no parent CMake -> build demo, if parent CMake do not.
Make VERBOSE an option
More explicit options for CUDA architectures, as well as overrides available if users so choose.
- Maybe the branch Maxwell would not be appropriate anymore.
TODO: revisit the install stuff given layout changes.
- Note to self: rpath of executable built needs to be explicit
There were some sections that gave different flags for Windows vs Apple vs Linux that I did not really understand. No SSE2 for Windows, and linux got a "-lineinfo" added (as opposed to all platforms). Did the reinterpretation given that Verbose is an option now make sense / should it remain this way?

Layout Changes

Since I desire to build this as a submodule, the split was to take *.h and put them in include/cudaSift, the two .cpp demo files ina demo/ folder, and the implementations went in src/. I have a very strong preference for the include/ and src split, because I like to have #include <cudaSift/sift.h> over #include <cudaSift.h>.

To be clear, this is very much a personal preference. I did it not as a suggestion, but because I was already going to be changing a lot.

Note: things like cudaSiftH.h -> cudaSift/sift_host.h are completely open to change. I couldn't decide whether your preference would be camel case or underscore, etc. E.g. cudaSift/SiftH.h or cudaSift/siftH.h etc. Happy to change the names back to whatever!

Code Changes

CUDA 9 deprecated shfl_xor. This fix seems to work as expected, but it may not be completely correct. The matches seem to be consistent with and without that code in there, but image diffs (my comparison method) in this regard are generally useless.
Everything is in namespace cudaSift. Promotion to a library build makes introducing the namespace appropriate in my opinion.
cudaSift::CudaImage felt a little awkward to me, so I changed it to just be cudaSift::Image. Everything else stayed the same.

Future Work

Edit:

These global symbols:

///////////////////////////////////////////////////////////////////////////////
// Kernel configuration
///////////////////////////////////////////////////////////////////////////////

__constant__ float d_Threshold[2];
__constant__ float d_Scales[8], d_Factor;
__constant__ float d_EdgeLimit;
__constant__ int d_MaxNumPoints;

__device__ unsigned int d_PointCounter[1];
__constant__ float d_Kernel1[5];
__constant__ float d_Kernel2[12*16];

were changed to be encapsulated in a struct SiftKernelParams that is managed by SiftData. There is a host and device copy, items are copied back and forth as needed. Basically, cudaMemcpy{To,From}Symbol calls were changed to just cudaMemcpyAsync. After numerous runs, there doesn't appear to be any noticeable performance change. Each of these are small enough that they probably don't have too many cache misses?

This had to be done since multiple streams cannot necessarily share any of these parameters.

Copyright.txt, Version, and sm_52 architecture ...

I'm playing CudaSift. Cool job...
But, in order to have it installed easily, you may have to:

Add "Copyright.txt" file under root folder
Better to have a version file, now 4.0.0?
sm_52 is for my computer now. So, maybe, you can let the users to choose their own Cuda architecture???

Cheers
Pei

Systematic Point Density

First, many thanks for keeping this project live.

Below, I have an example image that I am attempting to extract key points for. I am getting very strange systematic point distribution where the top of the higher has a significantly higher number of points. In the example image, keypoints are still identified across the rest of the image. In some cases, no points are found outside of the top of the image.

I am not sure if the total number of points (SiftData.maxPts) is being exceeded here and therefore breaking early?

Any insight appreciated.

Non-deterministically getting SiftPoint x, y positions outside image boundary?

Hello!

Thanks for the great library. We were using version three and are now trying version four. However, we notice a problem:

Occasionally, when we run the following code, we get SIFT points that live outside our image boundary (e.g., (341, 491) in an image that is (640, 480). We aren't sure why this is happening because we would expect that given identical images, we should get identical SIFT results (barring floating point errors, but 10+ pixels seems large).

For now, we simply throw out such SIFT points, but are curious to know if you've seen this issue before or if you have a hunch as to why it might be happening. We don't believe we're doing anything special, but please let us know if you think we might be doing something wrong!

Many thanks again for the work and any help!
Daniel

# gray is a 640x480 grayscale image read from 16bit PNG
gray.convertTo(gray, CV_32FC1);
unsigned int w = gray.cols;
unsigned int h = gray.rows;

InitCuda();
CudaImage cudaImage;
cudaImage.Allocate(w, h, iAlignUp(w, 128), false, NULL, (float*) gray.data);
cudaImage.Download();

ExtractSift(siftData, cudaImage, 5, 0.0f, 3.0f, 0.0f, 1.0f);

problem in extractSift method

hello and thanks for your great code.
when I built the code in visual studio 2015 many errors apeared as scaleDown and other functions of cudaSiftD.cu has already defined. I made declarations for them in cudaSiftD.h and then comment the #include "cudaSiftD.cu" from cudaSiftH.cu.
but after running the project with inputs of left.pgm and righ.pgm exactly after finishing the extractSift method 0 points fiound as features in both images. what should I do?

Are you using CUDA8.0 or CUDA9.0?

matching.cu does not compute the L2 norm?

Is the matching.cu computing the L2 notm? The line:

sum += siftPoint[16_j+tx] * ptr2[16_j+tx];

seems to be just element by element multiplication of the feature vector. I am not sure how this computes a distance between the two SIFT descriptors?

Additionally, the sums are being dome many times for the same element of the feature vector. As in the setting, tx would range from 0 to 15...

LaplaceMulti() execution failed

Hello,

I'm trying to run the mainSift.cpp program on a Tesla M2090. First of all I had to change all of sim_35 occurences into sim_20 in CMakeLists.txt. After that, this message is returned during the execution of LaplaceMulti():

checkMsg() CUDA error: LaplaceMulti() execution failed
 in file </ghome/rzhengac/Downloads/CudaSift-Maxwell/cudaSiftH.cu>, line 324 : unknown error.

Why this happens?

So what is the "right" value for initBlur ?

Hello,
I am used to the matlab vlfeat to manage sift features. The command line is :

edge_thresh = 2.5;
[xy,descritor id] = vl_sift(single(rgb2gray(image_i)), 'edgethresh', edge_thresh) ;

With Cuda Sift it is :

upscale = false.
ExtractSift(siftData1, img1, 7, initBlur, thresh, 0.0f, upscale);

A priori, i know nothing about the image quality. So what is the "right" value for initBlur ?

Thank you

drawMatches method

Hi, I just want to ask if there is any method to draw matches between a pair of images just like OpenCV's drawMatches ?

Problem in matching.cu

Error while executing ./cudasift

Message got after executing ./cudasift

Image size = (0,0)
Initializing data...
Device Number: 0
Device name: GeForce GTX 1060 6GB
Memory Clock Rate (MHz): 4004
Memory Bus Width (bits): 192
Peak Memory Bandwidth (GB/s): 192.2

Failed to allocate device data
Failed to allocate device data
Download time = 0.01 ms
Download time = 0.00 ms
Failed to allocate device data
checkMsg() CUDA error: LowPass() execution failed
in file </opt/CudaSift/cudaSiftH.cu>, line 351 : invalid configuration argument.

Out of Memory error

Hi @Celebrandil ,
I have got out of memory error when I try to MatchSiftData() of nearly 20000 features from each image. Does it related to memory allocation error.
Thanks

checkMsg() CUDA error: LowPass() execution failed

When I try and run cudasift I get the following output:

Image size = (0,0)
Initializing data...
Device Number: 0
Device name: GeForce GTX 860M
Memory Clock Rate (MHz): 2505
Memory Bus Width (bits): 128
Peak Memory Bandwidth (GB/s): 80.2

Failed to allocate device data
Failed to allocate device data
Failed to allocate device data
checkMsg() CUDA error: LowPass() execution failed in file </home/francois/Desktop/P3_CV3/CudaSift/cudaSiftH.cu>, line 416 : invalid configuration argument.

Does anyone know a solution to the problem?

ERROR : safeCall() Runtime API error in file </home/kk/Documents/CudaSift-Maxwell/cudaSiftH.cu>, line 42 : invalid device symbol.

Following is the output when the object file is executed.
`[kk@localhost CudaSift-Maxwell]$ ./cudasift
Image size = (1280,960)
Initializing data...
Device Number: 0
Device name: GeForce 210
Memory Clock Rate (MHz): 600
Memory Bus Width (bits): 64
Peak Memory Bandwidth (GB/s): 9.6

Download time = 0.92 ms
Download time = 0.89 ms
safeCall() Runtime API error in file </home/pict/Documents/CudaSift-Maxwell/cudaSiftH.cu>, line 42 : invalid device symbol.
[kk@localhost CudaSift-Maxwell]$ C
`
What is the problem?

error MSB6006: "cmd.exe" exited with code 1.

I build this project by CMake 3.9.1 and choose visual studio 2013.After gengerating solution file.I open it and rebuild but it always show error MSB6006: "cmd.exe" exited with code 1. I have double check there is no space in the file path and I also run as administrator.It stil has the same problem.

why descriptor doesn't work

Hello, first of all thank you very much for your contribution to the cuda version of the sift feature. now I am working on a project of image search. but when i use cudasift to extract the sift feature of both query image and database image, the sift feature of query image doesn't work. when i use CPU with vlfeat library to extract sift feature, I can achieve accuracy of 62.6%, but if I use cudasift, I can only achieve 6.2%。what is more strange is that if I just generate the 128-dimension descriptor using random methods, I can alse achieve 6%. So can I get some suggestion about this? thank you very much.

Potential Memory Leak in cudaSiftH.cu

Hello,

Thanks for keeping this project active. There is, however, a potential memory leak in cudaSiftH.cu. In particular, the host memory is allocated twice using malloc() then cudaMallocHost() at lines 226 and 227, respectively. Either one of them is enough and the other one may need to be removed.

Rotation invariance

Hi,
After replacing OpenCV's implementation of SIFT with CudaSift in my code I noticed that it could no longer handle much rotation between the images being matched. Is this a known limitation to CudaSift, or am I missing something?

When testing on mainSift.cpp and the sample data I get the following on the original images img1.png and img2.png:
Number of original features: 1818 1978
Number of matching features: 792 801 43.5644% 1 3

When rotating one of the images 180 degrees I get the following:
Number of original features: 1818 2040
Number of matching features: 261 400 14.3564% 1 3

Suggestions for future versions?

After the latests commits I'm running out of ideas of what to improve and would like to hear if anyone has any suggestions for future versions. For further speed improvements, I can see the possibility of adding functionalities for uploading images that are not necessarily in floats, using half precision floats for storage and matching of SIFT vectors, as well as projecting vectors to a lower dimension, similar to PCA-SIFT. In most practical scenarios though, gaining a fraction of a millisecond doesn't help much, since there is much more around it that is more important. Thus the nature of the end application becomes more important than the actual feature extraction code.

cudaMallocPitch error!!!

cv::imread("image.png",0).convertTo(img1,CV_32FC1);
this works well.... no issue

however, in my application the image is ROI of another image so I can't simply read it from file. So,
img2 = image2(ROI)
then, cv::cvtColor(img2,img2,CV_BGR2GRAY);
then, img2.convertTo(img2,CV_32FC1);

This causes segmentation fault. I also checked if they belong to the same data type.
img1.type() = img2.type() = 5. I also printed the cv::Mat, and are similar in size and content

The error seems to be caused by

safeCall(cudaMallocPitch((void *)&d_data, (size_t)&pitch, (size_t)(sizeof(float)*width), (size_t)height));
(note there are two *'s inside (void) in the above statement)
under cudaImage.cu

I can't get my head around this. Any such error in the past....?

cc: error: unrecognized command line option ‘-msse2’

nvidia@tegra-ubuntu:/usr/local/CudaSift-Maxwell/build$ make
[ 14%] Building NVCC (Device) object CMakeFiles/cudasift.dir/cudasift_generated_matching.cu.o
cc: error: unrecognized command line option ‘-msse2’
CMake Error at cudasift_generated_matching.cu.o.cmake:207 (message):
Error generating
/usr/local/CudaSift-Maxwell/build/CMakeFiles/cudasift.dir//./cudasift_generated_matching.cu.o

CMakeFiles/cudasift.dir/build.make:77: recipe for target 'CMakeFiles/cudasift.dir/cudasift_generated_matching.cu.o' failed
make[2]: *** [CMakeFiles/cudasift.dir/cudasift_generated_matching.cu.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/cudasift.dir/all' failed
make[1]: *** [CMakeFiles/cudasift.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2

Descriptors convertible to OpenCV?

Am I right to assume that data in SiftPoint contains a descriptor? Can I just take these arrays from all SiftPoints and convert them to an OpenCV descriptor Mat? Are they normalized to 0 to 255? Or do I need the values like scale, sharpness and so on in some way too?

I won't have time to build and test it myself the next few days, so it would be cool to know.

Some proposal for less gpu memory cost

First of all, thank you very much for this excellent work.
As I work on my cuda-affine-sift project base on this, it went out of memory when sift points count grows up to more than 25600.
I resolve this problem by these steps:
in file matching.cu, function MatchSiftPoints2:

I removed the memory copy codes (from sift1, sift2 to siftPoints1, siftPoints2), because I think pt1 and pt2 can directly point to sift1[n1].data and sift2[n2].data.
We only need to know max sum score, max sum index and second max sum score for each point in sift1, if we define grid and block like those for funciton FindMaxCorr:
dim3 blocksMax(iDivUp(numPts1, 16));
dim3 threadsMax(16, 16);
then these array is enough:
shared float maxScore[16 * 16];
shared float maxScor2[16 * 16];
shared int maxIndex[16 * 16];

Yes, as you guess, I combine function MatchSiftPoints2 and FindMaxCorr together, and compare the sum values as soon as they are calculated. My codes:

global void MatchSiftPoints(float *corrData, SiftPoint *sift1, SiftPoint *sift2, int numPts1, int numPts2)
{
int block_dim = blockDim.x; // blockDim.x == 16
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int p1 = blockIdx.x * block_dim + ty;
const int idx = ty * 16 + tx;

__shared__ int maxIndex[16 * 16];
maxIndex[idx] = 0;
__syncthreads();

float *corrs = NULL;

if (p1 < numPts1) {
	corrs = &corrData[p1 * block_dim * 2];
	corrs[tx] = 0.0f;
	corrs[tx + 16] = 0.0f;
	const float *pt1 = sift1[p1].data;
	for (int p2 = tx; p2 < numPts2; p2 += 16) {
		float *pt2 = sift2[p2].data;
		float sum = 0.0f;
		for (int i = 0; i < 128; i++) {
			//int itx = (i + tx) & 127; // avoid bank conflicts
			sum += pt1[i] * pt2[i];
		}
		if (sum > corrs[tx]) {
			corrs[tx + 16] = corrs[tx];
			corrs[tx] = sum;
			maxIndex[idx] = p2;
		}
		else if (sum > corrs[tx + 16])
			corrs[tx + 16] = sum;
	}
}
__syncthreads();

//if (p1==1)
//  printf("tx = %d, score = %.2f, scor2 = %.2f, index = %d\n", 
//	   tx, maxScore[idx], maxScor2[idx], maxIndex[idx]);

if (p1 < numPts1) {
	for (int len = 8; len > 0; len /= 2) {
		if (tx < len) {
			float val = corrs[tx + len];
			int i = maxIndex[idx + len];
			if (val > corrs[tx]) {
				corrs[tx + 16] = corrs[tx];
				corrs[tx] = val;
				maxIndex[idx] = i;
			}
			else if (val > corrs[tx + 16])
				corrs[tx + 16] = val;
			float va2 = corrs[tx + 16 + len];
			if (va2 > corrs[tx + 16])
				corrs[tx + 16] = va2;
		}
		__syncthreads();
		//if (p1 == 1 && tx<len)
		//	printf("tx = %d, score = %.2f, scor2 = %.2f, index = %d\n",
		//		tx, corrs[tx], corrs[tx + 16], maxIndex[idx]);
	}
	if (tx == 6)
		sift1[p1].score = corrs[0];
	else if (tx == 7)
		sift1[p1].ambiguity = corrs[16] / (corrs[0] + 1e-6);
	else if (tx == 8)
		sift1[p1].match = maxIndex[ty << 4];
	else if (tx == 9)
		sift1[p1].match_xpos = sift2[maxIndex[ty << 4]].xpos;
	else if (tx == 10)
		sift1[p1].match_ypos = sift2[maxIndex[ty << 4]].ypos;
}
__syncthreads();
//if (p1 == 1 && tx == 0)
//	printf("index = %d/%d, score = %.2f, ambiguity = %.2f, match = %d\n",
//		p1, numPts1, sift1[p1].score, sift1[p1].ambiguity, sift1[p1].match);

}

codes affected in function MatchSiftData:
`

int block_dim = 16;
float *d_corrData;
int corrSize = numPts1 * block_dim * 2;
safeCall(cudaMalloc((void **)&d_corrData, sizeof(float) * corrSize));
//double allocTime = timer.read();

dim3 blocks(iDivUp(numPts1, block_dim));
dim3 threads(block_dim, block_dim); // each block: 1 points x 16 points
//std::cout << "numPts1=" << numPts1 << ",numPts2=" << numPts2 << ",corrSize=" << corrSize << std::endl;
MatchSiftPoints << <blocks, threads >> > (d_corrData, sift1, sift2, numPts1, numPts2);
safeCall(cudaThreadSynchronize());
checkMsg("MatchSiftPoints() execution failed\n");

//double matchTime = timer.read();
//printf("MatchSiftPoints time =          %.2f ms\n", (matchTime - allocTime));

safeCall(cudaFree(d_corrData));

`
It works for me, as I can match more than 32000 points for each image on GTX1060 6G.

What does corrData stands for in function MatchSiftPoints2

Hi,

Function MatchSiftPoints2 is the core code lines relating to the sift matching process. Typically , Given two set of sift descriptors t1, t2. For any sift descriptor, desc_1 for example, in t1, we find the two closest sift descriptor sfit1, sift2 in t2 under L2 norm, if the ratio of L2(desc_1, sift1)/L2(desc_1, sift2) is small than delta. sift1 is accepted as a correspondence of desc_1.

I do not under stand the following meaning of the code line bellow in function MatchSiftPoints2.

const float pt1 = &siftPoints1[ty128];
const float pt2 = &siftPoints2[tx128];
float sum = 0.0f;
for (int i=0;i<128;i++) {
int itx = (i + tx)&127; // avoid bank conflicts
sum += pt1[itx]pt2[itx];
}
if (p1<numPts1)
corrData[p1gridDim.y*16 + p2] = (p2<numPts2 ? sum : -1.0f);

Could you give any explanation or the principle that you used in the sift matching process. Much thanks!

Identifier "totPts" and "fstPts" are undefined

Hi,

I am trying to compile this project on Windows 7 with Cmake.
But I get the following error message:
identifier "totPts" is undefined File: cudaSiftH.cu Line 226
identifier "fstPts" is undefined File: cudaSiftH.cu Line 229

Any help is appreciated :)

what is the difference between cudasift and lowe's sift

hello. I have a question about what is the difference of 128 dimension descriptor between cudasift and lowe's sift. as to the sift feature with the same position, the sequence of 128 dimension descriptor is same? or what is the correspondence? thank you very much.

the sequence of descriptor

hello, what the difference between 128 dimension descriptor of cudasift and that of lowe's sift. as for the same postion of the same image, are the descriptor same? if not, what the difference? thank you very much.

Possible unnecessary __syncthreads using for kernel function FindMaxCorr in matching.cu

Hello there!

Currently I am learning CudaSift source code and may find some unnecessary use of __syncthreads() for some kernel funcitons in matching.cu.

For kernel function FindMaxCorr,

__global__ void FindMaxCorr(float *corrData, SiftPoint *sift1, SiftPoint *sift2, int numPts1, int corrWidth, int siftSize)
{
  .........
  if (tx==6)
    sift1[p1].score = maxScore[ty*16];
  if (tx==7)
    sift1[p1].ambiguity = maxScor2[ty*16] / (maxScore[ty*16] + 1e-6);
  if (tx==8)
    sift1[p1].match = maxIndex[ty*16];
  if (tx==9)
    sift1[p1].match_xpos = sift2[maxIndex[ty*16]].xpos;
  if (tx==10)
    sift1[p1].match_ypos = sift2[maxIndex[ty*16]].ypos;
  __syncthreads();
}

In line 160, before kernel function finished, FindMaxCorr calls __syncthreads(), but what confuses me is that line 160 is the last code kernel function executing, there should be unnecessary to synchronize threads here?

Same issues comes for FindMaxCorr1, FindMaxCorr2, FindMaxCorr3.

Thanks very much! :)

Pascal cards

Has anyone tried the code other Pascal cards? I really would like to see the performance on NVidia Titan X. Unfortunately, I haven't get been able to order one.

Wrong Sift points coordinates

Hi, thanks a lot for your great work

I am having an issue with the code
Whenever i try to read the sift points after extracting them from the image, their x and y coordinates aren't where they should be

So i am simply drawing these points over the image with opencv and displaying them in real-time, and the sift points seem to construct shapes of objects in the image but with a larger scale and a big offset

I tried multiplying and dividing the x and y coordinates with the scale attribute of the corresponding sift point but it didn't work

Any idea how this might be solved??

Note: i am running the code on a GT 750m (Kepler w/ cc of 3.0, 4GB DDR3, 12ms to run)

Licensing and attribution

Sorry to bother again so soon, but someone reached out to us asking about commercial licensing our fork of CudaSift (link here). They are aware of David Lowe's original patent, but I wanted to a) properly attribute the work you did and b) protect both of us from liability. Do you have thoughts on how you would license CudaSift?

I was thinking of using of using the MIT License or even the Unlicense, but wanted to check with you first. Let me know!

Matching.cu

Hi,

In the matching.cu, to my understanding the following kernel code in function FindMaxCorr is used to do the 'reduce job' of finding the best 2 matches. However, I am sort of confused about the outcome. Since in one block all threads with tx < 8 are concurrent threads how can we guarantee that the best match result is eventually swapped into the first entry? I would appreciate a lot if you can explain this part a little bit.
__syncthreads();
for (int len=8;len>0;len/=2) {
if (tx<8) {
float val = maxScore[idx+len];
int i = maxIndex[idx+len];
if (val>maxScore[idx]) {
maxScor2[idx] = maxScore[idx];
maxScore[idx] = val;
maxIndex[idx] = i;
} else if (val>maxScor2[idx])
maxScor2[idx] = val;
float va2 = maxScor2[idx+len];
if (va2>maxScor2[idx])
maxScor2[idx] = va2;
}
__syncthreads();
I tried two images of my own and found all matches have a very high ambiguity.
Should the following be correct?
sift1[p1].ambiguity = (1 - maxScore[ty * 16]) / (1 - maxScore2[ty * 16] + 1e-6);

Thanks!

ubuntu run cudasift failure with LowPass() execution failed

I clone the newest code to run on ubuntu 16.04 (Sep 11, 2018), failed with the following error

Image size = (0,0)
Initializing data...
Device Number: 0
  Device name: GeForce GTX 1050 Ti
  Memory Clock Rate (MHz): 3504
  Memory Bus Width (bits): 128
  Peak Memory Bandwidth (GB/s): 112.1

Failed to allocate device data
Failed to allocate device data
Failed to allocate device data
checkMsg() CUDA error: LowPass() execution failed
 in file </home/aqrose/siqin_all/siqin_project/CudaSift/cudaSiftH.cu>, line 416 : invalid configuration argument.

Could anyone tell me how to handle this problem ?

dense sift?

Is there an implementation for dense SIFT?

May I ask what is Lowpass (host) function doing? Didn't find that in original paper

Hi, first of all thank you for providing the cuda version of SIFT as its very helpful for me on my self-learning computer vision process. I tried to follow the code all the way down but got stuck somewhere in the middle. I couldn't figure out what is Lowpass function (the host one) doing before extracting sift descriptors, and what are those global variables beginning with "LOWPASS_". I tried to find an explanation in paper but failed ( or maybe I didn't read it too carefully). So may I ask for some instruction on that? Thank you. Since this is not a bug I will close it as soon as answered.

Build error “cmd.exe” exit with code 1

Hi, Celebrandil
I want to build this project on my PC under the following conditions:
Branch: Maxwell
IDE: vs2015
cuda version: v 8.0
GPU: NVidia Titan X
but I got the following error in vs2015:

Building NVCC (Device) object CMakeFiles/cudasift.dir/Release/cudasift_generated_cudaImage.cu.obj
1> nvcc fatal : redefinition of argument 'gpu-architecture'
1> CMake Error at cudasift_generated_cudaImage.cu.obj.Release.cmake:222 (message):
1> Error generating
F:/MillerWorkPath/VSProject/CudaSift/Bin/CMakeFiles/cudasift.dir//Release/cudasift_generated_cudaImage.cu.obj
and the error message MSB6006: “cmd.exe” exit with code 1.
I am not familiar with the GPU programming, can you give me some tips to fix this problem, Thanks.
Miller

matching siftPoints of images with different sizes

hello and thanks for your great code.
when I run the project to detect points and compare with 2 images with different sizes the project crashs.
what should I do?

thanks.

hello man, could you add some annotations to the code ?

recently, i want to optimize my sift feature codes with cuda, i am very pleasure that you share those codes, but i have some troubles , no any explainations, when i read ,could you add some annotations to important code, thanks, very much. a stranger from across the ocean.

How do you copy SiftData structure ?

Dear the author.

First of all, thanks for developing such a nice SW.

I am having a problem that you might be able to help out.
I am trying to do sift extraction on many images and match them in a different function/scope. For example, I want to exact Sift features on 100 images. Later on, I would like to do sift match between image 53 and image 96.
In order to do that, I have to keep SiftData structures in some other places in my memory. Then the question is how do you copy SiftData structure to some other variable ?
The problem is that I can copy "h_data" by memcpy. But, I don't know what to do with "d_data". It seems like this pointer points to somewhere in my GPU device. Ignoring "d_data" giving me error when I call MatchSiftData().
Correct me if I am wrong since I am not a GPU guy.

Could you give me some advice ?

set parameters similar to opencv

hello sir.
I want to use opencv library for cpu & cv::sift with parameters : nFeatures = 40000 octaves = 3
contrast_threshhold = 0.01 edge_threshhold = 10 sigma = 1.6

i can not find edge_threshhold in you project.

how to should I set CudaSift parameters to get results similar to above.