Git Product home page Git Product logo

Comments (10)

lmcinnes avatar lmcinnes commented on August 27, 2024

Thanks for this. Challenging datasets are always interesting, and I will have to explore exactly what makes nearest neighbors so hard with this data. It does seem remarkable that it should perform quite so poorly, and is clearly not restricted to RP-trees and NNDesccent, since Annoy seems to have similar issues.

from pynndescent.

atarashansky avatar atarashansky commented on August 27, 2024

Typically with scRNA-seq datasets, there can be many points that are similarly close to a particular cell. For example, if you have a homogeneous population of 100 cells and you're choosing 15 nearest neighbors, then there could be a great deal of interchangeability.

If you plot the distribution of exact distances, do you see a bimodal distribution with a small number of neighbors with relatively small distances? Or do you see something less-than-bimodal, with a large number of neighbors with relatively small distances?

I expect the latter to be true. In my experience, having low accuracy compared to exact nearest neighbors is not really a deal breaker since pynndescent will still pick neighbors that are relatively close. (To begin with, there's nothing inherently biological about picking the exact 15 cells with the largest correlation values if there are many other cells with similar-ish similarities).

Here's my recommendation: Try doing exact nearest neighbors to get a graph and do Louvain clustering to assign clusters. Do the same with pynndescent. Compare the cluster assignments using something like sklearn.metrics.adjusted_rand_score. What's the clustering similarity like? Even if the neighbors aren't the same, you might find that the cluster assignments are extremely similar.

from pynndescent.

jlmelville avatar jlmelville commented on August 27, 2024

I appreciate the comments and suggestions @atarashansky. I don't currently have the bandwidth to do any clustering on the dataset, but it's a good idea. I've not looked at the distribution of all the neighbor distances, but I have for the exact 15 nearest neighbor and 150 nearest neighbor distances. Here's a histogram of the 150 nearest neighbors:

macosko2015-150nn

In general, I agree that the approximate nearest neighbors should do just as good a job as the exact nearest neighbors. But the UMAP plots do come out different (see some of the plots at https://jlmelville.github.io/uwot/pycompare#macosko2015 and then further down at the "What about t-SNE?" section).

from pynndescent.

atarashansky avatar atarashansky commented on August 27, 2024

How does pynndescent compare to the exact distances if you use correlation distance as opposed to (what seems to be) euclidean distance? Similar error rate?

from pynndescent.

jlmelville avatar jlmelville commented on August 27, 2024

Unfortunately, I did the majority of these calculations and data-processing in R, where I don't have access to nearest neighbor routines with correlation distance, so I'll have to get back to you at an unspecified time in the future on that one.

from pynndescent.

jlmelville avatar jlmelville commented on August 27, 2024

NN-Descent on High-Dimensional Data (further expanded in The Influence of Hubness on NN-Descent) makes the case that NN Descent does badly in datasets with lots of "hubs": nodes that show up in the reverse neighbor list of lots of other nodes. These will consistently appear in the candidate list of other nodes and non-hub-nodes are less likely to get locally-joined to decent candidates.

I calculated the "hubness" (the ratio of the reverse neighbor list size to the forward neighbor list) on this dataset for the exact k = 15 neighbors and it definitely suffers from hubness: the maximum hubness is over 700 (i.e. one point shows up in the 15-nearest neighbors of over 10,000 other nodes, nearly a quarter of the dataset). The second-most hubby (hubful?) dataset I've seen is CIFAR-10, which has a maximum hubness of a mere 140. Datasets like MNIST, FMNIST and KMNIST (which NND does well on) have a maximum hubness in the 5-20 range.

I consider this case closed except for actually fixing the problem. The linked papers above make some suggestions.

from pynndescent.

atarashansky avatar atarashansky commented on August 27, 2024

Good find! Incorporating indegree scores into the NN-descent algorithm does not seem like it should be too challenging. I'd be interested in submitting a PR tackling this issue. I'll keep you posted!

from pynndescent.

lmcinnes avatar lmcinnes commented on August 27, 2024

Thanks for the references. I recall we had some concerns about hubness, but never looked into it much. I will have to look through the referenced paper to see if there are practicable solutions to the problem.

Also of note for making NN-descent fail: lots of ties in distances. If there are many points equally close to a given point it can be hard to "descend" down the apparently flat slope of neighbor improvements. We have encountered a few datasets where this is the case, often with ties at 0 due to significant duplicates, or ties at 1.0 for a bounded distance like cosine or jaccard. Either case certainly negatively impacts performance.

from pynndescent.

jlmelville avatar jlmelville commented on August 27, 2024

the two strategies that stood out are:

  • increase the number of neighbors internally and scale down the sample rate to try and equalize the total computation time.
  • calculate hubness of the current graph at each iteration and use that to replace hub nodes with random selections.

I have no reason to believe these don’t work but are fairly brute force approaches. The suggested implementation for estimating the hubness (track which nodes are added and deleted from the graph) seems like a bit fiddly to add to pynndescent (tracking deletions seems like it would complicate things).

I did some preliminary calculations on various datasets and you can spot hub nodes fairly well after the first iteration as long as max_candidates is sufficiently high (20 worked; 5 seems too low). I haven’t got any further than that yet though.

from pynndescent.

lmcinnes avatar lmcinnes commented on August 27, 2024

That sounds good -- if we can spot hub nodes (and max_candidates should be higher than 5 almost always) then some mitigation might be easily tractable. I look forward to anything further you find on this.

from pynndescent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.