Git Product home page Git Product logo

Comments (11)

vmarkovtsev avatar vmarkovtsev commented on May 28, 2024 4

Thank you everybody!

So, to summarize my initial 3 points.

  1. As soon as I provide comprehensive evidence that kmeans++ improves upon random init on datasets with >1M points and 10k centroids, maintainers are glad to allow the contribution. I doubt that I provide this evidence for 10k centroids, as my own experience was based on 8M samples (OK) and 1-2k centroids (not OK). So I revoke this.

  2. Cosine distance can be calculated without taking arccos and not losing any precision, so initial (2) is revoked. The alternative proposal of L1 is interesting to maintainers but it will take much work. I preserve this in the hope that I find the time to experiment with it (after all, I almost finished the impl in kmcuda). I need time to read these fancy stuff about Nystrom-like methods / KPCA to understand them properly.

  3. Python3 support was added as the result of the discussion - closed.

Thus I am renaming the topic to make it better reflect the intent.

from faiss.

mdouze avatar mdouze commented on May 28, 2024

Hi!

Thanks for your interest in Faiss. Could you consider the following points:

  1. The k-means can be initialized with arbitrary centroids, when they are set on input in the Clustering object. Also, in our experiments with kmeans++ initialization, we have not found it to be more useful than random in large dimension and many points. What are the tasks you think it would be useful for?

  2. There is a "spherical k-means" option in the Clustering object that normalizes the centroids at each iteration. Is this what you are referring to?

  3. Python 3 is already supported (although the examples are written in Python 2).

from faiss.

vmarkovtsev avatar vmarkovtsev commented on May 28, 2024
  1. Sure, they can be imported. kmeans++ takes some time to finish and can be properly accelerated. We are using K-means for dimensionality reduction with large number of clusters, and kmeans++ shows substantial (up to 30%) faster convergence on our data.

  2. Kind of, but it's not enough to normalize the vectors. The problem is that the angular distance works better (again in our experiments). The naked scalar product (that is, cos \alpha) is not a distance metric by definition, it does not satisfy the triangle inequality. https://en.wikipedia.org/wiki/Cosine_similarity

  3. Swig generates python 2 native *.so by default and in that case I still have to manually edit the makefile. At least there must be py3 rule. I do understand that Facebook production is based on 2 but the rest of the folks have already switched. E.g. google/grumpy#1
    _swigfaiss.so: undefined symbol: PyCObject_Type and friends.

from faiss.

jegou avatar jegou commented on May 28, 2024

Regarding 1.: Matthijs and I are the author of the Yael library (http://yael.gforge.inria.fr/, CPU library), which included a k-means++ initialization. Our observation was that, on all our experimental data, the initialization was actually costly and not BLAS3-friendly, even though we put some effort into its parallelization. As Matthijs mentioned, the gain from k-means++ was not worth this cost in our experiments (very small). For a fixed amount of time, we found that it was better (w.r.t. loss) to add a few iterations to the regular k-means, which is why random selection was the default choice in our Matlab interface.
But I understand that this does not have to be generally true, especially if you look at the data at convergence (which we normally don't do considering our typical parameters, i.e., many centroids and input points).

Regarding 2.: as mentioned by Matthijs, Faiss includes a spherical k-means, targeting to maximize \sum_i <q(x_i) | x_i>

  1. Assignment step: On the sphere, cos=inner product. I assume that we agree that the use of arccos has not effect on the assignment of points to centroids because this function is monotonous: the closest centroid w.r.t. L2, or maximum cosine similarity, or maximum inner product are all equivalent on the sphere (with both the input vectors and centroids on the sphere).
  2. Update step: Take the centroid that minimizes the l2 loss, i.e., the mean of all vectors assigned to the centroid, and project it back on the sphere.

So, this is the algorithm described (Figure 1) in
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.111.8125&rep=rep1&type=pdf

We are aware that there are some variants of this algorithm, including one I co-authored that modifies the cost because the objective function is different (see Section 6.5 in https://arxiv.org/pdf/1412.3328.pdf). To which algorithm do you refer?

Thanks for your feedback!

from faiss.

mdouze avatar mdouze commented on May 28, 2024

Hi @vmarkovtsev,

Indeed the port to Python 3 was not complete. I added instructions to the INSTALL and makefile.inc.Linux on how to compile for py3. Thanks for pointing this out.

There are three possible levels of integration of other kmeans algorithms:

  1. re-use the Clustering and input another initialization
  2. re-use the Index object and the km_update_centroids function that does the M-step
  3. re-use only the Index object for fast assignent.

At which level do you think your changes to the clustering could be intergated?

Thanks

from faiss.

vmarkovtsev avatar vmarkovtsev commented on May 28, 2024

@jegou,

  1. I don't know what to say... As I said, we observed substantial (subjective) achieved quality and convergence improvements. I guess, if K-means iterations are blazingly fast, it can indeed be more profitable to simply add more iterations instead of improving the initialization to get the same metrics. Our implementation is slower, and it made sense to run kmeans++. Do you strictly target only very large uniformly distributed datasets? I could prepare a notebook to show what I mean in some future.

  2. Correct, I am referring to exactly the described algorithm. I didn't read this paper but it is implemented word-to-word in our library, albeit we take arccos. Indeed, since arccos in monotonous, we can swap arccos(a \cdot b) with -(a \cdot b).

@mdouze,
I forgot that I've also got L1 besides L2 and angular, though somewhat not ready for production. As I believe, L1 has nothing to do with inner products. I looked through the code and thought that L1 and L2 mean the other thing there (1-D and 2-D input). Am I right?

I guess my proposed additions in (1) fall in (1).

from faiss.

jegou avatar jegou commented on May 28, 2024

@vmarkovtsev
We do not target uniformly datasets but real-life features (large ones). At the time we made the test with k-means++, we had mostly experimented with various features extracted from images (SIFT, GIST, etc), such as those provided in the following benchmarks (widely used in indexing for approximate search benchmarking):
http://corpus-texmex.irisa.fr/

If you have done some experiments with k-means++ vs random init (large-scale, we are not much interested by dataset comprising less than 1 million input vectors and less than 10k centroids), I am of course curious of the results.
One technical point: one of the strength of k-means++ is to avoid random clusters in the first iterations. Random init is bad by default for spiky distributions, but this is easily compensated by detecting the case and employing a smart (and costless) splitting strategy, which we do in Faiss. I assume that, in the comparison between k-means++ and random init, that you carefully handle this point, otherwise IMO the comparison would not be conclusive.

from faiss.

vmarkovtsev avatar vmarkovtsev commented on May 28, 2024

I see, thanks. Our datasets are somewhat NLP-ish, https://data.world/vmarkovtsev are the examples.

from faiss.

wickedfoo avatar wickedfoo commented on May 28, 2024

Any implementation on the GPU aiming for pairwise L1 distance will likely be a lot slower, since there would need to be an equivalent to the GEMM kernel that calculates d_mn = sum_k |x_mk - y_kn| instead of d_mn = sum_k (x_mk * y_kn). The GEMM strategy is efficient because there is high memory reuse in the register file and shared memory via tiling.

However, cuBLAS GEMM itself is quite inefficient for small k (k being the dimension of the vectors). Full arithmetic throughput is only seen for really large k. Small k is also in the regime where GEMM starts to be memory bandwidth / latency bound rather than arithmetically bound.

A kernel could probably be written for L1 using a GEMM-like strategy, but it would take a lot of time to get right or tune appropriately for different architectures.

from faiss.

jegou avatar jegou commented on May 28, 2024

@vmarkovtsev: L1 is not derivable from an inner product.
However, many other kernels such as the intersection kernel (often good replacement for L1) can be approximated with explicit embeddings, a strategy which is more and more popular and you may be aware of. This can be done by Nystrom-like methods / KPCA. Personally, I am a big fan of the ones proposed by Andrea Vedaldi:
https://www.robots.ox.ac.uk/~vedaldi/assets/pubs/vedaldi11efficient.pdf

Employing such embeddings, you can map a vector x->phi(x), such that the inner product phi(x)^T phi(y) approximates your kernel k(x,y). Then you can simply search/cluster based on the vectors phi(x), for instance for a search application such as proposed in
https://hal.inria.fr/hal-00722635v2/document

from faiss.

wickedfoo avatar wickedfoo commented on May 28, 2024

Duplicate of #848

merging into a more general task, might support some of this for GPU over the next couple of months

from faiss.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.