In the Annoy function, there is a large block of code that loads val_P/row_P/col_P vec

Good question--there are two uses for this code: <p dir="auto

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

I close this issue now that <a class="issue-link js-issue-link" data-error-text="Faile

Loading val_P/row_P/col_P from files: unused part of the code? about fit-sne HOT 10 CLOSED

klugerlab commented on August 26, 2024

Loading val_P/row_P/col_P from files: unused part of the code?

from fit-sne.

Comments (10)

linqiaozhi commented on August 26, 2024

Good question--there are two uses for this code:

I found that people often wanted to obtain the input similarity matrix that t-SNE generates and load it into, say, Matlab. Then they would use it for some other visualization or processing task.
I had originally implemented the following feature: if the val_P/row_P/col_P files existed, then load them and skip the computation of input similarities. I found this very useful when my input data (and perplexity) were fixed, and I was experimenting with other parameters and initializations. It is common for people to run t-SNE several times with different initializations, and since the input similarities matrix is exactly the same for each of these experiments, it made sense to not recompute them every time. I ended up commenting it out because I was afraid people would forget they were using the old input similarities. If implemented correctly (probably with a parameter passed in, as opposed to just checking for the files), I think this would be a useful feature.

from fit-sne.

dkobak commented on August 26, 2024

I see. I agree it can be a useful feature, but perhaps it should be implemented in TSNE::run() and not downstream. If save flag is on, then row_P, col_P, val_P (or P) are saved into a file after they are computed. If load flag is on, then they are read from the file, and not recomputed. By default, they are not loaded and not saved. Something like that.

Maybe it can be implemented as one input flag: default 0 means do nothing, 1 means save, 2 means load. And Matlab/R/Python wrappers can use more informative names for this flag, e.g. 'load'/'save'/None.

from fit-sne.

linqiaozhi commented on August 26, 2024

Agreed--I think this would be a great way to implement it.

from fit-sne.

dkobak commented on August 26, 2024

I might do it but most likely not before the end of the month. If you don't mind, leave this open for now.

By the way, the code for computeGaussianPerplexity() using Annoy and using VP is quite similar (the same memory allocation, the same multithreading, the same progress messages, etc.) and if I do any extensions such as e.g. adding K_random random points to the K nearest neighbours, then I have to do it in both places. What do you think of combining these two functions into one with knn_algo as an input parameter? I think this could simplify the code.

from fit-sne.

dkobak commented on August 26, 2024

Hi George. I could pull request the stuff I've been working on during next week. Just wanted to double check what you think about combining computeGaussianPerplexity() for annoy/VP into one function (as per my comment above).

I could do the following PRs:

Load/save Ps into file.
Combine computeGaussianPerplexity() for annoy/VP to simplify the code.
Perplexity combination
Supply K_random in addition to K

Let me know if you think any of that is not needed here. I am not quite sure how useful (4) will in the end be, but I found it interesting to experiment with.

from fit-sne.

linqiaozhi commented on August 26, 2024

Hey @dkobak, so sorry for the late response! I thought I replied, but clearly I did not.

Refactoring the redundant code in computeGaussianPerplexity() out into one function would be great. I think that 1 and 2 are very useful.

As for 3, I think you had sent me a citation showing that this was useful. Did you find that in large datasets its useful as well? Also, is it computationally feasible on large datasets? Large perplexities mean lots of nearest neighbors, and I know that looping through a large number of neighbors at each iteration is the bottleneck for speed when you have lots of points...so I just want to make sure it's useful in large datasets.

As for 4, I think it's really interesting. Do you have a citation for how this can improve the embedding? I'd like to only include features that have been shown to be useful, so that it's not confusing to people.

Thanks so much, and again, my apologies for the tardy response!

from fit-sne.

dkobak commented on August 26, 2024

Re 3: yes, there is a paper advocating this procedure. You are of course right about large datasets: it's not really possible to use large perplexities for large datasets. However, for n \approx 25k datasets I found it very useful. In the paper I am preparing we are actually going to recommend this perplexity combination as one of the useful "tricks". For one particular n=25k dataset that I'm using as my main example, I obtain best results using combination of perplexities [10, 100, 500]. So it's up to you, but if you don't want to merge this feature into your code, I will have to refer readers to my branch :)

As for 4, no, I don't have any citations. I did find it marginally useful in some situations but not useful enough to be honest. I'll probably not be describing/recommending it now. I agree that it's not worth merging it into the master.

I'd like to only include features that have been shown to be useful, so that it's not confusing to people.

Sure. Even though you do have some "experimental" stuff in the code, like e.g. setting fixed sigma instead of perplexity etc. :)

from fit-sne.

linqiaozhi commented on August 26, 2024

Great, whenever you get the chance, please go ahead and also PR the perplexity combination (3). I look forward to trying it out--and reading the paper you are preparing!

Thanks so much for all your help!!

from fit-sne.

dkobak commented on August 26, 2024

I will be away for the next two weeks and by now it's clear that I won't be able to do this before I leave. So I will try to do these PRs some time in the middle of August. In case you are also on vacations then, it'll wait until September. Cheers.

from fit-sne.

dkobak commented on August 26, 2024

I close this issue now that #25 was merged.

from fit-sne.

Loading val_P/row_P/col_P from files: unused part of the code? about fit-sne HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent