alexdepremia / advanced-density-peaks Goto Github PK
View Code? Open in Web Editor NEWLicense: BSD 3-Clause "New" or "Revised" License
License: BSD 3-Clause "New" or "Revised" License
DPA carries on a density topography analisys by means of an unsupervised version of density peaks clustering (Rodriguez and Laio, Science, 2014). The code reads a standard input (few examples are provided in the Examples directory). Run the program in order to see an interactive description of the input. Please note that this is a a preliminary version: do not hesitate in comunicating errors/bugs in the code. The code performs the following operations: * Compute Euclidean distances (with/without periodic coordinates i.e. angles) or read distances computed with an external code. * Compute the intrinsic dimension of the data by using the TWO-NN algorithm (Facco et al , Sci. Rep. 2017). * Compute the logarithm of the density for each point and its error by the PAk approach (Rodriguez et al, 2018) or by the standard k-NN. * Extract information about the topography by analizing the border densities between clusters for distinguishing statistical fluctuations from real regions of high density (d'Errico, 2018, https://arxiv.org/abs/1802.10549). In its present implementation, the code uses one of the ORDERPACK modules written by Michel Olagnon (http://www.physics.rutgers.edu/~diablo/) Compiling can be done by executing the "./compile.sh" script. The executable appears in ./bin folder as "DPA.x". In the "examples" folders there are some toy cases. In standard linux machines they can be run with the "make_examples.sh" script and plotted (if you have installed gnuplot) with the "plot_examples.sh" script. The program reads as input files of three kinds: (1) A symetric distance file written as e1,e2,d(e1,e2). Element indexes must start from 1. Field separators accepted are space or tab. (2) A coordinates file, in which each row corresponds to a data point, each column to a coordinate. In this case the distance matrix is computed with the Euclidean metric (optionally periodic boundary conditions can be considered). Field separators accepted are space, comma, semicolon or tab. (3) A distance file written as e1,e2,d(e1,e2) with the nearest neighbors for each data point. Distances that are not specified are considered very large. Field separators accepted are space or tab. The ouput files are: * Point.info: Seven columns, (1) index of the point (2) k employed for computing the density at this point (3) distance from the k-th neighbor (4) logarithm of the density of the point (plus a constant factor) (5) Error of the logarithm of the density at the point (6) The cluster at which the point is assigned (7) The cluster at which the point is assigned, but 0 if the point belongs to the halo * Topography.info: Two sections: CENTERS: Six columns, (1) index of the cluster (2) logarithm of the density at the center of the cluster (3) Error of the logarithm of the density at the center of the cluster (4) Point index of the center of the cluster (5) Cluster population (6) Cluster population without halo points SADDLES: Four columns, (1 & 2) index of the clusters that define the saddle point (3) logarithm of the density of the saddle point (4) Error of the logarithm of the density at the saddle point * cluster.log: A simple log file about what is happening during execution. * Dendrogram_graph_2.dat & Dendrogram_graph_2.gpl: Gnuplot script for obtaining the dendrogram associated with the topography. * Topography.mat: Corresponds to the Topography matrix, in which the diagonal entries are the heights of the peaks and the off- diagonal entries are the heights of the saddle points separating these peaks. In this case, the peaks are sorted according with the Dendrogram. * mat_labels.dat: Allows easy plotting of the Topography matrix by the "plot_examples.sh" script. * Restart_summary.rst: Allows recomputing the cluster partition and the dendrogram (for instance with a different Z value) without recomputing/reading the distances nor recomputing the densities. For using it you need to add rst as first argument to DPA.x Moreover, if we are computing the dimension, additional files are generated. We strongly recomend to plot them before setting the dimension, checking if the results comply with the quality standards defined in ref (Facco, Sci. Rep. 2017): * xy_dim.dat & xy_dim.gpl: Gnuplot script to check the slope of the S-graph (view ref. Facco et al 2017). * decimation_plot.dat & decimation_plot.gpl: Gnuplot script to check the stability of the dimension estimation with respect to decimation. In the tools folder we provide: Mat2input.sh, a bash script for transforming distance matrix files to acceptable distance input files. Its usage is: ./Mat2input.sh matrix_file_name input_file_for_DPA Bibliography: 1) Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492-1496. 2) Facco, E., d’Errico, M., Rodriguez, A., & Laio, A. (2017). Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific reports, 7(1), 12140. 3) Rodriguez, A. d'Errico, M., Facco, E. & Laio, A. (2018) Computing the free energy without collective variables. JCTC (in press). 4) d'Errico, M., Facco, E., Laio, A. & Rodriguez, A. (2018) Automatic topography of high-dimensional data sets by non-parametric Density Peak clustering (Submitted, https://arxiv.org/abs/1802.10549).
I'm unsure about the part of the code in which borders between clusters are identified (lines 1978-2013)
#!fortran90
do i=1,Nele
if (.not.centquest(i)) then
dmin=maxr
do j=1,Nele
if (cluster(j).ne.cluster(i)) then
d=gDist(i,j)
if (d.lt.dmin) then
dmin=d
jg=j
endif
endif
enddo
if (dmin.le.dc(i)) then
iref=i
k=0
extend=.true.
do while ( (k.lt.Nele).and.extend)
k=k+1
if (cluster(k).eq.cluster(i)) then
if (gDist(k,jg).lt.dmin) extend=.false.
endif
enddo
if (extend) then
candidates_B(i)=cluster(jg)
if (Rho(iref).gt. Bord(cluster(i),cluster(jg))) then
Bord(cluster(i),cluster(jg))=Rho(iref)
Bord(cluster(jg),cluster(i))=Rho(iref)
Bord_err(cluster(i),cluster(jg))=Rho_err(iref)
Bord_err(cluster(jg),cluster(i))=Rho_err(iref)
eb(cluster(i),cluster(jg))=iref
eb(cluster(jg),cluster(i))=iref
endif
endif
endif
endif
enddo
Specifically, the first inner loop
#!fortran90
do j=1,Nele
if (cluster(j).ne.cluster(i)) then
d=gDist(i,j)
if (d.lt.dmin) then
dmin=d
jg=j
endif
endif
enddo
given a point i, will find its nearest neighbour not belonging to cluster(i), which will be labelled ig. Then, the following step is to check if there's any point in cluster(i) closer to ig.
This is similar but not equal to what I read in the arxiv manuscript:
"A point i, belonging to cluster c, is assumed to be at the border between cluster c and c' if its closest point j belonging to c' is within a distance r_k_i and if i is the closest point to j among those belonging to c"
In particular, according to the manuscript definition, given cluster c, and point i, one should consider each of the other clusters, and find the point closest to i for each of the other clusters, then do the second check on each of them.
Am I interpreting any of this wrong?
Sorry if there's something obvious that I'm missing
I want to use other datasets. How can I get the file with the suffix ".in" in the folder "examples"?
Thank you for your help!
Sometimes, the dendrogram generation goes in an infinite loop, generating a huge gpl file. Probably we should program again the subroutine graph2 (and simplify it).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.