Git Product home page Git Product logo

Comments (10)

linqiaozhi avatar linqiaozhi commented on August 26, 2024

Good question--there are two uses for this code:

  1. I found that people often wanted to obtain the input similarity matrix that t-SNE generates and load it into, say, Matlab. Then they would use it for some other visualization or processing task.

  2. I had originally implemented the following feature: if the val_P/row_P/col_P files existed, then load them and skip the computation of input similarities. I found this very useful when my input data (and perplexity) were fixed, and I was experimenting with other parameters and initializations. It is common for people to run t-SNE several times with different initializations, and since the input similarities matrix is exactly the same for each of these experiments, it made sense to not recompute them every time. I ended up commenting it out because I was afraid people would forget they were using the old input similarities. If implemented correctly (probably with a parameter passed in, as opposed to just checking for the files), I think this would be a useful feature.

from fit-sne.

dkobak avatar dkobak commented on August 26, 2024

I see. I agree it can be a useful feature, but perhaps it should be implemented in TSNE::run() and not downstream. If save flag is on, then row_P, col_P, val_P (or P) are saved into a file after they are computed. If load flag is on, then they are read from the file, and not recomputed. By default, they are not loaded and not saved. Something like that.

Maybe it can be implemented as one input flag: default 0 means do nothing, 1 means save, 2 means load. And Matlab/R/Python wrappers can use more informative names for this flag, e.g. 'load'/'save'/None.

from fit-sne.

linqiaozhi avatar linqiaozhi commented on August 26, 2024

Agreed--I think this would be a great way to implement it.

from fit-sne.

dkobak avatar dkobak commented on August 26, 2024

I might do it but most likely not before the end of the month. If you don't mind, leave this open for now.

By the way, the code for computeGaussianPerplexity() using Annoy and using VP is quite similar (the same memory allocation, the same multithreading, the same progress messages, etc.) and if I do any extensions such as e.g. adding K_random random points to the K nearest neighbours, then I have to do it in both places. What do you think of combining these two functions into one with knn_algo as an input parameter? I think this could simplify the code.

from fit-sne.

dkobak avatar dkobak commented on August 26, 2024

Hi George. I could pull request the stuff I've been working on during next week. Just wanted to double check what you think about combining computeGaussianPerplexity() for annoy/VP into one function (as per my comment above).

I could do the following PRs:

  1. Load/save Ps into file.
  2. Combine computeGaussianPerplexity() for annoy/VP to simplify the code.
  3. Perplexity combination
  4. Supply K_random in addition to K

Let me know if you think any of that is not needed here. I am not quite sure how useful (4) will in the end be, but I found it interesting to experiment with.

from fit-sne.

linqiaozhi avatar linqiaozhi commented on August 26, 2024

Hey @dkobak, so sorry for the late response! I thought I replied, but clearly I did not.

Refactoring the redundant code in computeGaussianPerplexity() out into one function would be great. I think that 1 and 2 are very useful.

As for 3, I think you had sent me a citation showing that this was useful. Did you find that in large datasets its useful as well? Also, is it computationally feasible on large datasets? Large perplexities mean lots of nearest neighbors, and I know that looping through a large number of neighbors at each iteration is the bottleneck for speed when you have lots of points...so I just want to make sure it's useful in large datasets.

As for 4, I think it's really interesting. Do you have a citation for how this can improve the embedding? I'd like to only include features that have been shown to be useful, so that it's not confusing to people.

Thanks so much, and again, my apologies for the tardy response!

from fit-sne.

dkobak avatar dkobak commented on August 26, 2024

Re 3: yes, there is a paper advocating this procedure. You are of course right about large datasets: it's not really possible to use large perplexities for large datasets. However, for n \approx 25k datasets I found it very useful. In the paper I am preparing we are actually going to recommend this perplexity combination as one of the useful "tricks". For one particular n=25k dataset that I'm using as my main example, I obtain best results using combination of perplexities [10, 100, 500]. So it's up to you, but if you don't want to merge this feature into your code, I will have to refer readers to my branch :)

As for 4, no, I don't have any citations. I did find it marginally useful in some situations but not useful enough to be honest. I'll probably not be describing/recommending it now. I agree that it's not worth merging it into the master.

I'd like to only include features that have been shown to be useful, so that it's not confusing to people.

Sure. Even though you do have some "experimental" stuff in the code, like e.g. setting fixed sigma instead of perplexity etc. :)

from fit-sne.

linqiaozhi avatar linqiaozhi commented on August 26, 2024

Great, whenever you get the chance, please go ahead and also PR the perplexity combination (3). I look forward to trying it out--and reading the paper you are preparing!

Thanks so much for all your help!!

from fit-sne.

dkobak avatar dkobak commented on August 26, 2024

I will be away for the next two weeks and by now it's clear that I won't be able to do this before I leave. So I will try to do these PRs some time in the middle of August. In case you are also on vacations then, it'll wait until September. Cheers.

from fit-sne.

dkobak avatar dkobak commented on August 26, 2024

I close this issue now that #25 was merged.

from fit-sne.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.