Git Product home page Git Product logo

clustertree's Introduction

clustertree

CRAN LifeCycle Status Travis-CI Build Status AppVeyor Build Status

clustertree is a fast and extensible R package for estimating the empirical cluster tree, a hierarchical representation of high-density clusters, defined (recursively) as the connected subsets of:

From a high-level perspective, the cluster tree provides a highly interpretable, multi-resolution, hierarchical summary of the underlying density of a [finite] sample. The package includes both tools and complete implementations of the following list of estimators:
  1. The Robust Single Linkage (RSL) algorithm from:

Chaudhuri, Kamalika, and Sanjoy Dasgupta. "Rates of convergence for the cluster tree." Advances in Neural Information Processing Systems. 2010.

  1. The KNN and Mutual KNN variants from Algorithm 2 analyzed in:

Chaudhuri, Kamalika, et al. "Consistent procedures for cluster tree estimation and pruning." IEEE Transactions on Information Theory 60.12 (2014): 7900-7912.


Additional tools for working with the tree(s) are also provided. Namely,


  1. Stuetzle's Runt pruning technique is provided, from:

Stuetzle, Werner. "Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample." Journal of classification 20.1 (2003): 025-047.

  1. Eldridge et. al's Merge Distortion metric is also provided, from:

Eldridge, Justin, Mikhail Belkin, and Yusu Wang. "Beyond hartigan consistency: Merge distortion metric for hierarchical clustering." Conference on Learning Theory. 2015.


The tree(s) returned are by default standard hclust objects. Support for densities estimated via KDE techniques are planned for the future. For more information regarding the utility of this package and of the cluster tree itself, see the Usage and Additional References sections, respectively.

Installation

The package currently only exists on github. The installation options are as follows:

  1. Installed directly from the repo with help from the devtools package, i.e.

    install.packages("devtools")
    devtools::install_github("peekxc/clustertree")
  2. Download the most recent successful build from AppVeyor

Development note

The package is actively developing. A release candidate for CRAN will be created eventually.

Usage

library("clustertree")

data("iris")
x <- as.matrix(iris[, 1:4])

Run Robust Single Linkage

ct <- clustertree(x, k = 15L, alpha = sqrt(2), estimator = "RSL")
ct
Cluster tree object of: 150 objects.
Estimator used: Robust Single Linkage
Parameters: k = 15, alpha = 1.4142, dim = 4

Plot the cluster tree. Like other hierarchical clustering algorithms, the main tree is stored internally as an 'hclust' object

plot(ct)
is(ct$hc, "hclust") 

TRUE

You can also use either of the two linkage criteria studied From Algorithm 2 in [2] listed above:

ct2 <- clustertree(x, k = 15L, alpha = sqrt(2), estimator = "KNN")
ct3 <- clustertree(x, k = 15L, alpha = sqrt(2), estimator = "mutual KNN")

Unlike other hierarchical algorithms, it's possible that these do not form complete hierarchies. This can happen when there is a sufficiently low density areas separating high density clusters. For example, it's well known the Iris setosa species is clearly separable from the other two species. This is reflected in the Mutual KNN graph, the sparser of the two estimators. These disjoint connected components are also stored as trees, i.e.

Iris Setosa tree

length(ct3$hc) ## == 2
ct3$hc[[1]]
...
Cluster method   : mutual knn 
Number of objects: 50 
ct3$hc[[2]]
...
Cluster method   : mutual knn 
Number of objects: 100 

Typical hierarchical clustering structures represent every singleton as a possible cluster, but obviously, not every singleton will be in a disjoint high density cluster. One approach to making these modes more apparent is to specify a threshold to to use as a means of 'runt pruning' from [3] above.

This can significantly simplify the tree:

  ct_simplified <- runt_prune(ct, 2)
  plot(ct_simplified[[1]]) ## Three detected modes of density

Additional References

The cluster tree theory itself has a long history. For a brief overview of the definition, see section 11.13 of:

Hartigan, John A., and J. A. Hartigan. Clustering algorithms. Vol. 209. New York: Wiley, 1975. for an overview of what Hartigan refers to as the density-contour tree, and brief overview of the background and associated concepts of formal high-density clustering.

Established notions of consistency of the above class of estimators can be found in:

Hartigan, John A. "Consistency of single linkage for high-density clusters." Journal of the American Statistical Association 76.374 (1981): 388-394.

Acknowledgements

This package is part of the Google Summer of Code 2017 initiative, under the R Project.

clustertree's People

Contributors

peekxc avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.