cwehmeyer / pydpc Goto Github PK
View Code? Open in Web Editor NEWClustering by fast search and find of density peaks
License: GNU Lesser General Public License v3.0
Clustering by fast search and find of density peaks
License: GNU Lesser General Public License v3.0
With some parameters, the method assign() results in a Segmentation fault 11 error.
The error occurs in this line:
self.halo_idx, self.core_idx = _core.get_halo(
self.density, self.membership,
self.border_density, self.border_member.astype(_np.intc), border_only=border_only)
After one comment out the line, it works fine.
My question is, can I comment out the line without distorting the clustering result ?
Hi Dev Team, thanks for the package, its one of my favs since y'all implimenation is straightforward and even include some improvements from the paper.
My issue is that recently I used the package to cluster large corpus of text (the tf-idf of the corpus). The RAM on my computer instance is 1.5 TB so there definately is room for memory.
But what tends to happen is, if I pass an array/dataframe with more that 50k observations the program shuts down (usually at the "self.kernel_size = _core.get_kernel_size(self.distances, self.fraction" line). The data sent in is all numeric and the distances are calculated using from sklearn.metrics.pairwise.cosine_similarity.
When it shuts down it gives no error message except:
"The python program has shutdown"
Thanks for your contribution, but I can't find any documentation or tutorial to use this package. Is there any approach to get one?
Hi,
First off, thanks for making this optimized DPC implementation available -- it's a big help.
When a sufficient (fraction * n
) quantity of off-diagonal zeros are present in the pairwise distance matrix, the kernel size can be set to zero, leading to division by zero and nonsense zeros and NaNs in the density.
This could be solved relatively simply by adding a fallback case to _get_kernel_size()
:
// ...
if(kernel_size == 0.0) {
for(int i = 0; i < n; i++) {
if(scratch[i] != 0.0) {
kernel_size = scratch[i];
break;
}
}
}
Taking the first non-zero distance as the width of the Gaussian seems like a reasonable fallback assumption. Under any circumstances, if pydpc
doesn't want to presume that this is a reasonable fallback, either the user should be given the ability to specify kernel_width
or an error should be thrown.
(I don't think this scenario is too pathological, either: I came across it because I had duplicate points under the Euclidean metric, which may or may not make sense depending on the problem, but a user could also pass in a distance matrix generated by some function that, for example, marks as identical inputs within some cutoff, etc., etc.)
Thanks,
A.M.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.