alexdepremia / advanced-density-peaks Goto Github PK

License: BSD 3-Clause "New" or "Revised" License

Shell 0.31% Fortran 99.69%

advanced-density-peaks's Introduction

DPA carries on a density topography analisys by means of an unsupervised version of  density peaks clustering (Rodriguez and Laio, Science, 2014). 
The code reads a standard input (few examples are provided in the Examples directory).  
Run the program in order to see an interactive description of the input.
Please note that this is a a preliminary version: do not hesitate in comunicating errors/bugs in the code.
The code performs the following operations:

  * Compute Euclidean distances (with/without periodic coordinates i.e. angles) or read distances computed with an external code.
  * Compute the intrinsic dimension of the data by using the TWO-NN algorithm (Facco et al , Sci. Rep.  2017).
  * Compute the logarithm of the density for each point and its error by the PAk approach (Rodriguez et al, 2018) or by the standard k-NN.
  * Extract information about the topography by analizing the border densities between clusters for distinguishing 
    statistical fluctuations from real regions of high density (d'Errico, 2018, https://arxiv.org/abs/1802.10549).

In its present implementation, the code uses one of the ORDERPACK modules written by Michel Olagnon (http://www.physics.rutgers.edu/~diablo/)

Compiling can be done by executing the "./compile.sh" script. The executable appears in ./bin folder as "DPA.x".

In the "examples" folders there are some toy cases. In standard linux machines they can be run with the
"make_examples.sh" script and plotted (if you have installed gnuplot) with the "plot_examples.sh" script.

The program reads as input files of three kinds:
                               (1) A symetric distance file written as e1,e2,d(e1,e2). Element indexes must start from 1. Field separators accepted are space or tab.
                               (2) A coordinates file, in which each row corresponds to a data point, each column to a coordinate.
                                   In this case the distance matrix is computed with the Euclidean metric (optionally periodic 
                                   boundary conditions can be considered). Field separators accepted are space, comma, semicolon or tab.
                               (3) A distance file written as e1,e2,d(e1,e2) with the nearest neighbors for each data point.
                                   Distances that are not specified are considered very large. Field separators accepted are space or tab.

The ouput files are:

  * Point.info: Seven columns, (1) index of the point
                               (2) k employed for computing the density at this point
                               (3) distance from the k-th  neighbor
                               (4) logarithm of the density of the point (plus a constant factor)
                               (5) Error of the logarithm of the density at the point 
                               (6) The cluster at which the point is assigned
                               (7) The cluster at which the point is assigned, but 0 if the point belongs to the halo

  * Topography.info: Two sections:

                     CENTERS: Six columns,
                             (1) index of the cluster
                             (2) logarithm of the density at the center of the cluster
                             (3) Error of the logarithm of the density at the center of the cluster
                             (4) Point index of the center of the cluster
                             (5) Cluster population
                             (6) Cluster population without halo points

                     SADDLES: Four columns,
                             (1 & 2) index of the clusters that define the saddle point
                             (3) logarithm of the density of the saddle point
                             (4) Error of the logarithm of the density at the saddle point 

  * cluster.log: A simple log file about what is happening during execution.

  * Dendrogram_graph_2.dat & Dendrogram_graph_2.gpl: Gnuplot script for obtaining the dendrogram associated with the topography.

  * Topography.mat: Corresponds to the Topography matrix, in which the diagonal entries are the heights of the peaks and the off-
diagonal entries are the heights of the saddle points separating these peaks. In this case, the peaks are sorted according with the
Dendrogram.

  * mat_labels.dat: Allows easy plotting of the Topography matrix by the "plot_examples.sh" script.
  
  * Restart_summary.rst: Allows recomputing the cluster partition and the dendrogram (for instance with a different Z value) without recomputing/reading the distances nor recomputing the densities. For using it you need to add rst as first argument to DPA.x

Moreover, if we are computing the dimension, additional files are generated.  We strongly recomend to plot them before 
setting the dimension, checking if the results comply with the quality standards defined in ref (Facco, Sci. Rep. 2017):

  * xy_dim.dat & xy_dim.gpl: Gnuplot script to check the slope of the S-graph (view ref. Facco et al 2017).

  * decimation_plot.dat & decimation_plot.gpl:  Gnuplot script to check the stability of the dimension estimation with respect to  decimation.

In the tools folder we provide:

Mat2input.sh, a bash script for transforming distance matrix files to acceptable distance input files. Its usage is:

  ./Mat2input.sh matrix_file_name input_file_for_DPA


Bibliography:

  1) Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of 
  density peaks. Science, 344(6191), 1492-1496.

  2) Facco, E., d’Errico, M., Rodriguez, A., & Laio, A. (2017). Estimating the
  intrinsic dimension of datasets by a minimal neighborhood information.
  Scientific reports, 7(1), 12140.

  3) Rodriguez, A. d'Errico, M., Facco, E. & Laio, A. (2018) Computing the free
  energy without collective variables. JCTC (in press).

  4) d'Errico, M., Facco, E., Laio, A. & Rodriguez, A. (2018) Automatic
  topography of high-dimensional data sets by non-parametric Density Peak
  clustering (Submitted, https://arxiv.org/abs/1802.10549).

advanced-density-peaks's People

Contributors

Stargazers

Watchers

Forkers

alishakiba nkm-ml hnymzrl

advanced-density-peaks's Issues

Borders identification

I'm unsure about the part of the code in which borders between clusters are identified (lines 1978-2013)

#!fortran90

do i=1,Nele
  if (.not.centquest(i)) then
    dmin=maxr
    do j=1,Nele
      if (cluster(j).ne.cluster(i)) then
        d=gDist(i,j)
        if (d.lt.dmin) then
          dmin=d
          jg=j
        endif
      endif
    enddo
    if (dmin.le.dc(i)) then
      iref=i
      k=0
      extend=.true.
      do while ( (k.lt.Nele).and.extend)
        k=k+1
        if (cluster(k).eq.cluster(i)) then
          if (gDist(k,jg).lt.dmin) extend=.false.
        endif
      enddo
      if (extend) then
        candidates_B(i)=cluster(jg)
        if (Rho(iref).gt. Bord(cluster(i),cluster(jg))) then
          Bord(cluster(i),cluster(jg))=Rho(iref)
          Bord(cluster(jg),cluster(i))=Rho(iref)
          Bord_err(cluster(i),cluster(jg))=Rho_err(iref)
          Bord_err(cluster(jg),cluster(i))=Rho_err(iref)
          eb(cluster(i),cluster(jg))=iref
          eb(cluster(jg),cluster(i))=iref
        endif
      endif
    endif
  endif
enddo

Specifically, the first inner loop

#!fortran90

do j=1,Nele
      if (cluster(j).ne.cluster(i)) then
        d=gDist(i,j)
        if (d.lt.dmin) then
          dmin=d
          jg=j
        endif
      endif
    enddo

given a point i, will find its nearest neighbour not belonging to cluster(i), which will be labelled ig. Then, the following step is to check if there's any point in cluster(i) closer to ig.
This is similar but not equal to what I read in the arxiv manuscript:

"A point i, belonging to cluster c, is assumed to be at the border between cluster c and c' if its closest point j belonging to c' is within a distance r_k_i and if i is the closest point to j among those belonging to c"

In particular, according to the manuscript definition, given cluster c, and point i, one should consider each of the other clusters, and find the point closest to i for each of the other clusters, then do the second check on each of them.

Am I interpreting any of this wrong?
Sorry if there's something obvious that I'm missing

How can I get the files with the suffix ".in" in the folder "examples" from other datasets?

I want to use other datasets. How can I get the file with the suffix ".in" in the folder "examples"?
Thank you for your help!

Problem in generation of the dendrogram

Sometimes, the dendrogram generation goes in an infinite loop, generating a huge gpl file. Probably we should program again the subroutine graph2 (and simplify it).

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.