Git Product home page Git Product logo

pydpc's People

Contributors

cwehmeyer avatar linux-cpp-lisp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pydpc's Issues

Segmentation fault 11 in assign() method

With some parameters, the method assign() results in a Segmentation fault 11 error.
The error occurs in this line:
self.halo_idx, self.core_idx = _core.get_halo(
self.density, self.membership,
self.border_density, self.border_member.astype(_np.intc), border_only=border_only)
After one comment out the line, it works fine.
My question is, can I comment out the line without distorting the clustering result ?

Package shuts down on large data.

Hi Dev Team, thanks for the package, its one of my favs since y'all implimenation is straightforward and even include some improvements from the paper.

My issue is that recently I used the package to cluster large corpus of text (the tf-idf of the corpus). The RAM on my computer instance is 1.5 TB so there definately is room for memory.

But what tends to happen is, if I pass an array/dataframe with more that 50k observations the program shuts down (usually at the "self.kernel_size = _core.get_kernel_size(self.distances, self.fraction" line). The data sent in is all numeric and the distances are calculated using from sklearn.metrics.pairwise.cosine_similarity.

When it shuts down it gives no error message except:
"The python program has shutdown"

`kernel_size` can be zero, giving nonsense results, when enough distances are zero

Hi,

First off, thanks for making this optimized DPC implementation available -- it's a big help.

When a sufficient (fraction * n) quantity of off-diagonal zeros are present in the pairwise distance matrix, the kernel size can be set to zero, leading to division by zero and nonsense zeros and NaNs in the density.

This could be solved relatively simply by adding a fallback case to _get_kernel_size():

// ...
if(kernel_size == 0.0) {
    for(int i = 0; i < n; i++) {
        if(scratch[i] != 0.0) {
            kernel_size = scratch[i];
            break;
        }
    }
}

Taking the first non-zero distance as the width of the Gaussian seems like a reasonable fallback assumption. Under any circumstances, if pydpc doesn't want to presume that this is a reasonable fallback, either the user should be given the ability to specify kernel_width or an error should be thrown.

(I don't think this scenario is too pathological, either: I came across it because I had duplicate points under the Euclidean metric, which may or may not make sense depending on the problem, but a user could also pass in a distance matrix generated by some function that, for example, marks as identical inputs within some cutoff, etc., etc.)

Thanks,
A.M.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.