Comments (11)
Thank you everybody!
So, to summarize my initial 3 points.
-
As soon as I provide comprehensive evidence that kmeans++ improves upon random init on datasets with >1M points and 10k centroids, maintainers are glad to allow the contribution. I doubt that I provide this evidence for 10k centroids, as my own experience was based on 8M samples (OK) and 1-2k centroids (not OK). So I revoke this.
-
Cosine distance can be calculated without taking arccos and not losing any precision, so initial (2) is revoked. The alternative proposal of L1 is interesting to maintainers but it will take much work. I preserve this in the hope that I find the time to experiment with it (after all, I almost finished the impl in kmcuda). I need time to read these fancy stuff about Nystrom-like methods / KPCA to understand them properly.
-
Python3 support was added as the result of the discussion - closed.
Thus I am renaming the topic to make it better reflect the intent.
from faiss.
Hi!
Thanks for your interest in Faiss. Could you consider the following points:
-
The k-means can be initialized with arbitrary centroids, when they are set on input in the Clustering object. Also, in our experiments with kmeans++ initialization, we have not found it to be more useful than random in large dimension and many points. What are the tasks you think it would be useful for?
-
There is a "spherical k-means" option in the Clustering object that normalizes the centroids at each iteration. Is this what you are referring to?
-
Python 3 is already supported (although the examples are written in Python 2).
from faiss.
-
Sure, they can be imported. kmeans++ takes some time to finish and can be properly accelerated. We are using K-means for dimensionality reduction with large number of clusters, and kmeans++ shows substantial (up to 30%) faster convergence on our data.
-
Kind of, but it's not enough to normalize the vectors. The problem is that the angular distance works better (again in our experiments). The naked scalar product (that is, cos \alpha) is not a distance metric by definition, it does not satisfy the triangle inequality. https://en.wikipedia.org/wiki/Cosine_similarity
-
Swig generates python 2 native *.so by default and in that case I still have to manually edit the makefile. At least there must be py3 rule. I do understand that Facebook production is based on 2 but the rest of the folks have already switched. E.g. google/grumpy#1
_swigfaiss.so: undefined symbol: PyCObject_Type
and friends.
from faiss.
Regarding 1.: Matthijs and I are the author of the Yael library (http://yael.gforge.inria.fr/, CPU library), which included a k-means++ initialization. Our observation was that, on all our experimental data, the initialization was actually costly and not BLAS3-friendly, even though we put some effort into its parallelization. As Matthijs mentioned, the gain from k-means++ was not worth this cost in our experiments (very small). For a fixed amount of time, we found that it was better (w.r.t. loss) to add a few iterations to the regular k-means, which is why random selection was the default choice in our Matlab interface.
But I understand that this does not have to be generally true, especially if you look at the data at convergence (which we normally don't do considering our typical parameters, i.e., many centroids and input points).
Regarding 2.: as mentioned by Matthijs, Faiss includes a spherical k-means, targeting to maximize \sum_i <q(x_i) | x_i>
- Assignment step: On the sphere, cos=inner product. I assume that we agree that the use of arccos has not effect on the assignment of points to centroids because this function is monotonous: the closest centroid w.r.t. L2, or maximum cosine similarity, or maximum inner product are all equivalent on the sphere (with both the input vectors and centroids on the sphere).
- Update step: Take the centroid that minimizes the l2 loss, i.e., the mean of all vectors assigned to the centroid, and project it back on the sphere.
So, this is the algorithm described (Figure 1) in
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.111.8125&rep=rep1&type=pdf
We are aware that there are some variants of this algorithm, including one I co-authored that modifies the cost because the objective function is different (see Section 6.5 in https://arxiv.org/pdf/1412.3328.pdf). To which algorithm do you refer?
Thanks for your feedback!
from faiss.
Hi @vmarkovtsev,
Indeed the port to Python 3 was not complete. I added instructions to the INSTALL
and makefile.inc.Linux
on how to compile for py3. Thanks for pointing this out.
There are three possible levels of integration of other kmeans algorithms:
- re-use the
Clustering
and input another initialization - re-use the
Index
object and thekm_update_centroids
function that does the M-step - re-use only the
Index
object for fast assignent.
At which level do you think your changes to the clustering could be intergated?
Thanks
from faiss.
-
I don't know what to say... As I said, we observed substantial (subjective) achieved quality and convergence improvements. I guess, if K-means iterations are blazingly fast, it can indeed be more profitable to simply add more iterations instead of improving the initialization to get the same metrics. Our implementation is slower, and it made sense to run kmeans++. Do you strictly target only very large uniformly distributed datasets? I could prepare a notebook to show what I mean in some future.
-
Correct, I am referring to exactly the described algorithm. I didn't read this paper but it is implemented word-to-word in our library, albeit we take arccos. Indeed, since arccos in monotonous, we can swap arccos(a \cdot b) with -(a \cdot b).
@mdouze,
I forgot that I've also got L1 besides L2 and angular, though somewhat not ready for production. As I believe, L1 has nothing to do with inner products. I looked through the code and thought that L1 and L2 mean the other thing there (1-D and 2-D input). Am I right?
I guess my proposed additions in (1) fall in (1).
from faiss.
@vmarkovtsev
We do not target uniformly datasets but real-life features (large ones). At the time we made the test with k-means++, we had mostly experimented with various features extracted from images (SIFT, GIST, etc), such as those provided in the following benchmarks (widely used in indexing for approximate search benchmarking):
http://corpus-texmex.irisa.fr/
If you have done some experiments with k-means++ vs random init (large-scale, we are not much interested by dataset comprising less than 1 million input vectors and less than 10k centroids), I am of course curious of the results.
One technical point: one of the strength of k-means++ is to avoid random clusters in the first iterations. Random init is bad by default for spiky distributions, but this is easily compensated by detecting the case and employing a smart (and costless) splitting strategy, which we do in Faiss. I assume that, in the comparison between k-means++ and random init, that you carefully handle this point, otherwise IMO the comparison would not be conclusive.
from faiss.
I see, thanks. Our datasets are somewhat NLP-ish, https://data.world/vmarkovtsev are the examples.
from faiss.
Any implementation on the GPU aiming for pairwise L1 distance will likely be a lot slower, since there would need to be an equivalent to the GEMM kernel that calculates d_mn = sum_k |x_mk - y_kn|
instead of d_mn = sum_k (x_mk * y_kn)
. The GEMM strategy is efficient because there is high memory reuse in the register file and shared memory via tiling.
However, cuBLAS GEMM itself is quite inefficient for small k
(k
being the dimension of the vectors). Full arithmetic throughput is only seen for really large k
. Small k
is also in the regime where GEMM starts to be memory bandwidth / latency bound rather than arithmetically bound.
A kernel could probably be written for L1 using a GEMM-like strategy, but it would take a lot of time to get right or tune appropriately for different architectures.
from faiss.
@vmarkovtsev: L1 is not derivable from an inner product.
However, many other kernels such as the intersection kernel (often good replacement for L1) can be approximated with explicit embeddings, a strategy which is more and more popular and you may be aware of. This can be done by Nystrom-like methods / KPCA. Personally, I am a big fan of the ones proposed by Andrea Vedaldi:
https://www.robots.ox.ac.uk/~vedaldi/assets/pubs/vedaldi11efficient.pdf
Employing such embeddings, you can map a vector x->phi(x)
, such that the inner product phi(x)^T phi(y)
approximates your kernel k(x,y)
. Then you can simply search/cluster based on the vectors phi(x)
, for instance for a search application such as proposed in
https://hal.inria.fr/hal-00722635v2/document
from faiss.
Duplicate of #848
merging into a more general task, might support some of this for GPU over the next couple of months
from faiss.
Related Issues (20)
- Is it possible to lazy load index from disk? HOT 1
- Binary embeddings score normalization HOT 1
- No conda package for faiss-cpu 1.8.0 for osx-64 on pytorch channel HOT 5
- Static library libfaiss_gpu.a not installed HOT 1
- faiss_gpu object is not linked to static library libfaiss.a HOT 3
- Error when building static library for AVX2 and GPU HOT 2
- Cannot debug similarity search HOT 1
- Add a tutorial for IndexHNSW HOT 3
- Segfault error on faiss.IndexIVFFlat().train HOT 1
- knn_gpu should use raft when raft is compiled in HOT 2
- ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found HOT 1
- Remove lapack dependency? HOT 1
- Faiss imported after Torch leads to segfault HOT 2
- Suggestions on implementing multi-scale quantization HOT 3
- The similarity results obtained from the index.faiss file are significantly different from those obtained from previous versions HOT 1
- inquiry related to DistanceComputer HOT 2
- Failed to install via poetry HOT 1
- Update the raft handle through StandardGpuResourcesImpl::setDefaultStream
- [Feature Request] GPU indices Provide Interface to Access Resource HOT 2
- faiss index and retriever not able to save HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from faiss.