cwehmeyer / pydpc Goto Github PK

View Code? Open in Web Editor NEW

62.0 62.0 20.0 58 KB

Clustering by fast search and find of density peaks

License: GNU Lesser General Public License v3.0

C 5.94% Python 84.81% Jupyter Notebook 7.93% C++ 1.31%

pydpc's People

Contributors

Stargazers

Watchers

pydpc's Issues

Segmentation fault 11 in assign() method

With some parameters, the method assign() results in a Segmentation fault 11 error.
The error occurs in this line:
self.halo_idx, self.core_idx = _core.get_halo(
self.density, self.membership,
self.border_density, self.border_member.astype(_np.intc), border_only=border_only)
After one comment out the line, it works fine.
My question is, can I comment out the line without distorting the clustering result ?

Package shuts down on large data.

Hi Dev Team, thanks for the package, its one of my favs since y'all implimenation is straightforward and even include some improvements from the paper.

My issue is that recently I used the package to cluster large corpus of text (the tf-idf of the corpus). The RAM on my computer instance is 1.5 TB so there definately is room for memory.

But what tends to happen is, if I pass an array/dataframe with more that 50k observations the program shuts down (usually at the "self.kernel_size = _core.get_kernel_size(self.distances, self.fraction" line). The data sent in is all numeric and the distances are calculated using from sklearn.metrics.pairwise.cosine_similarity.

When it shuts down it gives no error message except:
"The python program has shutdown"

Is there any document to understand and use this package?

Thanks for your contribution, but I can't find any documentation or tutorial to use this package. Is there any approach to get one?

`kernel_size` can be zero, giving nonsense results, when enough distances are zero

Hi,

First off, thanks for making this optimized DPC implementation available -- it's a big help.

When a sufficient (fraction * n) quantity of off-diagonal zeros are present in the pairwise distance matrix, the kernel size can be set to zero, leading to division by zero and nonsense zeros and NaNs in the density.

This could be solved relatively simply by adding a fallback case to _get_kernel_size():

// ...
if(kernel_size == 0.0) {
    for(int i = 0; i < n; i++) {
        if(scratch[i] != 0.0) {
            kernel_size = scratch[i];
            break;
        }
    }
}

Taking the first non-zero distance as the width of the Gaussian seems like a reasonable fallback assumption. Under any circumstances, if pydpc doesn't want to presume that this is a reasonable fallback, either the user should be given the ability to specify kernel_width or an error should be thrown.

(I don't think this scenario is too pathological, either: I came across it because I had duplicate points under the Euclidean metric, which may or may not make sense depending on the problem, but a user could also pass in a distance matrix generated by some function that, for example, marks as identical inputs within some cutoff, etc., etc.)

Thanks,
A.M.

cwehmeyer / pydpc Goto Github PK

pydpc's People

Contributors

Stargazers

Watchers

Forkers

pydpc's Issues

Segmentation fault 11 in assign() method

Package shuts down on large data.

Is there any document to understand and use this package?

`kernel_size` can be zero, giving nonsense results, when enough distances are zero

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent