Git Product home page Git Product logo

clustrviz's Introduction

GitHub Actions Build Status codecov Coverage Status License: GPL v3 CRAN_Status_Badge Project Status: Active – The project has reached a stable, usable state and is being actively developed.

clustRviz

clustRviz aims to enable fast computation and easy visualization of Convex Clustering solution paths.

Installation

You can install clustRviz from github with:

# install.packages("devtools")
devtools::install_github("DataSlingers/clustRviz")

Note that RcppEigen (which clustRviz internally) triggers many compiler warnings (which cannot be suppressed per CRAN policies). Many of these warnings can be locally suppressed by adding the line CXX11FLAGS+=-Wno-ignored-attributes to your ~/.R/Makevars file. To install an R package from source, you will need suitable development tools installed including a C++ compiler and potentially a Fortran runtime. Details about these toolchains are available on CRAN for Windows and macOS.

Quick-Start

There are two main entry points to the clustRviz package, the CARP and CBASS functions, which perform convex clustering and convex biclustering respectively. We demonstrate the use of these two functions on a text minining data set, presidential_speech, which measures how often the 44 U.S. presidents used certain words in their public addresses.

library(clustRviz)
#> Registered S3 method overwritten by 'seriation':
#>   method         from 
#>   reorder.hclust gclus
data(presidential_speech)
presidential_speech[1:6, 1:6]
#>                     amount appropri  british     cent commerci commission
#> Abraham Lincoln   3.433987 2.397895 1.791759 2.564949 2.708050   2.079442
#> Andrew Jackson    4.248495 4.663439 2.995732 1.945910 3.828641   3.218876
#> Andrew Johnson    4.025352 3.091042 2.833213 3.332205 2.772589   2.079442
#> Barack Obama      1.386294 0.000000 0.000000 1.386294 0.000000   0.000000
#> Benjamin Harrison 4.060443 4.174387 2.302585 4.304065 3.663562   3.465736
#> Calvin Coolidge   3.713572 4.094345 1.386294 3.555348 2.639057   1.609438

Clustering

We begin by clustering this data set, grouping the rows (presidents) into clusters:

carp_fit <- CARP(presidential_speech)
#> Pre-computing weights and edge sets
#> Computing Convex Clustering [CARP] Path
#> Post-processing
print(carp_fit)
#> CARP Fit Summary
#> ====================
#> 
#> Algorithm: CARP (t = 1.05) 
#> Fit Time: 0.142 secs 
#> Total Time: 0.636 secs 
#> 
#> Number of Observations: 44 
#> Number of Variables:    75 
#> 
#> Pre-processing options:
#>  - Columnwise centering: TRUE 
#>  - Columnwise scaling:   FALSE 
#> 
#> Weights:
#>  - Source: Radial Basis Function Kernel Weights
#>  - Distance Metric: Euclidean
#>  - Scale parameter (phi): 0.01 [Data-Driven]
#>  - Sparsified: 4 Nearest Neighbors [Data-Driven]

The algorithmic regularization technique employed by CARP makes computation of the whole solution path almost immediate.

We can examine the result of CARP graphically. We begin with a standard dendrogram, with three clusters highlighted:

plot(carp_fit, type = "dendrogram", k = 3)

Examing the dendrogram, we see two clear clusters, consisting of pre-WWII and post-WWII presidents and Warren G. Harding as a possible outlier. Harding is generally considered one of the worst US presidents of all time, so this is perhaps not too surprising.

A more interesting visualization is the dynamic path visualization, whereby we can watch the clusters fuse as the regularization level is increased:

plot(carp_fit, type = "path", dynamic = TRUE)

BiClustering

The use of CBASS for convex biclustering is similar, and we demonstrate it here with a cluster heatmap, with the regularization set to give 3 observation clusters:

cbass_fit <- CBASS(presidential_speech)
#> Pre-computing column weights and edge sets
#> Pre-computing row weights and edge sets
#> Computing Convex Bi-Clustering [CBASS] Path
#> Post-processing rows
#> Post-processing columns
plot(cbass_fit, k.row = 3)

By default, plotting the result of CBASS gives the traditional cluster heatmap, but we can also get the row or column dendrograms as well:

plot(cbass_fit, type = "row.dendrogram", k.row = 3)

By default, if a regularization level is specified, all plotting functions in clustRviz will plot the clustered data. If the regularization level is not specified, the raw data will be plotted instead:

plot(cbass_fit, type = "heatmap")

More details about the use and mathematical formulation of CARP and CBASS may be found in the package documentation.

clustrviz's People

Contributors

jasminezhuoy avatar jjn13 avatar kenzeng24 avatar michaelweylandt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

clustrviz's Issues

Introductory Vignette

Vignettes giving introduction to convex clustering / bi-clustering + comparison to other methods. (Can adapt from the original papers on the subject)

tSNE-Based Weights

Add "tSNE" and "preSNE"-type weights.

As originally discussed in #7, @tianyibourne and @minjiewang1991 have had some success with alternate weighting schemes.

Code to Create `presidential_speech` data set

Add code to (re)create presidential_speech data set. This will allow the data set to be re-run as more speeches are recorded. (If we update the "official" version in the package, we may need to version it somehow.)

CARP crash

Running into the following:

> carp_fit <- CARP(tf, labels = track_features$track_uri)
�Xshould be a matrix, not a data.frame. Converting with as.matrix(). (Called from CARP)
Pre-computing weights and edge sets
Error in if (num_connected_old == num_connected) { : 
  missing value where TRUE/FALSE needed

Also, it's taking ~10 minutes just to get to this error on a 2600 by 11 matrix -- is that expected?

Can send data for a reprex if necessary.

Error in dynamic path visualization when following tutorial:

I was following the example from the quickstart and got the following error:

>plot(carp_fit, type = "dynamic_path")

nframes and fps adjusted to match transition
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
Error in magick::image_animate(anim, fps, loop = if (loop) 0 else 1) :                                                  
  argument 'fps' must be a factor of 100

[Edited by MW: Added triple-backticks for code formatting]

I don't know if this is an issue of my local configuration, but I thought I should mention it

Hardcoded paths to strip and uname

In Makevars strip and uname are assumed to be in /usr/bin and /bin respectively. This is not the case on some systems (e.g., NixOS), requiring patching to remove the hardcoded paths.

U Smoothing

Add a post-processing step to "smooth" U by replacing the U-hat for clustered elements with the mean U-hat for elements in that cluster. This won't change the implied clusterings, but it will improve some minor problems with the path graphics, where the centroids do not exactly coincide.

(Note that this isn't a real CARP / CBASS issue but rather an deeper issue with running the ADMM for a finite amount of time on these problems: similar to how interior point methods will never get exact zeros even when run to essentially numerical convergence on the lasso and need a final thresholding step.)

Once this is done, we can speed up plotting by only showing the distinct path elements instead of every path element. The plot_frame elements passed to ggplot can go through a dplyr::distinct first to speed things up / avoid massive over-plotting. This should alleviate some of the slowness issues at the heart of #56.

TODO:

  • Add CARP U smoothing
  • Add CBASS U smoothing
  • Update get_cluster_centroids and get_clustered_data
  • Speed up path plotting

Visual Tests

We need real tests (not just "runs without error" tests) for graphics functions. It looks like the vdiffr package is the best way to do so. I'll work on adding some for the static graphics after I finish #43.

@jjn13 Any ideas on testing the results of the interactive / dynamic parts?

Remove `static` and `interactive` flags

The static and iteractive flags to CARP and CBASS were originally added to speed up fitting for cases when plotting would not be needed. If we can make these calculations cheap (vs. the actual clustering), there's no reason to have these flags and we can simplify the code base significantly.

Now (2018-09-12), the remaining culprit appears to be the PCA-projection in CARP (used to visualize cluster paths).

Add Clustering & Bi-Clustering Performance Benchmarks

We should add some benchmarks indicating the advantages of convex (bi)clustering over classical methods. Depending on the run-time, these can be a vignette or just included as a "run for yourself" file.

@jjn13 I believe you had some of these from early experiments. Could you share? If you email them to me, I'll work them up into a vignette.

Handling Well-Separated Observations

When a data point is far from the bulk, e^(-d^2) can underflow to zero for all (potential) neighbors yielding an unconnected graph. While this is numerically a problem, it's not conceptually one so we should handle it better than just throwing an error.

@agenevera suggested adding a small epsilon nugget to the weights (possibly to all, or at least to all the zero weights) in the dense calculations. This would still be mostly sparsified away, but would fix connectedness issues.

(See prior discussion and reproducible example in #54)

AppVeyor improvements

Two medium-/long-term TODOs:

  • The README AppVeyor badge is showing failure even when the latest build is successful
  • Can we use AppVeyor to test PRs as well?

Improve C++ Hygiene

A user has reported issues with compilation on Windows that appear to result from the lack of a system typedef for uint. (Windows provides typedef unsigned int UINT instead) Possible fixes being tried out in my mw/fix_uint branch.

This is a good opportunity to check general int and signed/unsigned hygiene. See also DataSlingers/ExclusiveLasso#12.

clustering.CBASS returns values on the transformed data

I would expect this to be true:

cbass_fit <- CBASS(presidential_speech)
all(clustering(cbass_fit, percent = 0)$cluster.mean.matrix == t(presidential_speech))
#> [1] FALSE

(Note that the t() shouldn't be there either as noted in #9)

Increase Number of Frames in Plots

By default, gif export and shiny app plot a relatively small number of frames (21 I think), resulting in a pretty choppy "movie experience" -- let's increase this.

Path plots for `CBASS`

I'd like to be able to create CARP-style path plots for the both the observation and feature clustering performed by CBASS.

Progress Updates

Add dynamic output to assure users that progress is being made. I don't know how to do an "honest" progress bar for this problem, so something of the form

[SPINNER] Number of Edges Fused: E / E_max. Percent Variance Remaining: V_norm / V_norm_0. Iterations Performed: I. Current Gamma: G.

(Perhaps not all of that if it's too long - probably clip to 80 characters)

Fix CI Segfault in Rendering CARP Plots on Ubuntu

The Ubuntu test-bed on GitHub Actions trips a segfault when trying to render CARP plots under R 3.6.3. This doesn't seem to be a problem in our code, but it makes our CI history much "dirtier" than it should be and slows down PR reviews.

Update Builds

It seems GH can only serve GH pages from master branch so rename branches:

  • stable --> master
  • master --> develop

Also need to update TravisCI scripts

"Tidy" Accessors

Add as.data.frame and similar accessors for programmatic usage of clustering solutions.

Change `CBASS` `obs` and `var` to `row` and `col` consistently

@agenevera wants us to change "obs" and "var" to "row" and "col" throughout since the notion of "observation" and "feature" might not apply wherever bi-clustering is used. This is not a hard change to make, but we should do it after we decide whether to use $X$ or $X^T$ internally throughout, since working with $X^T$ will make it very hard to keep "row" and "col" clear in the code.

Let's wait until the technical documentation (#28) is complete before deciding how to proceed.

clustering.CBASS returns a transposed data matrix

clustering.CBASS returns a transposed data matrix, contrary to the documentation:

cbass_fit <- CBASS(presidential_speech)
dim(clustering(cbass_fit, percent = 0.5))
#> [1] 75 44

dim(presidential_speech)
#> [1] 44 75

Tighten alpha bound for biclustering

The current alpha bound works in practice, but it is not as tight as theoretically possible. However, tighter bounds seemed to result in periodic failure to converge. Need to determine the right alpha get performance guarantees.

Increase Frequency of Interrupt Checks

For some long running operations, I was unable to stop the calculations once they started. Might want to check that Rcpp::checkUserInterrupt() gets called semi-frequently in the main looping operations as recommended in the Best Practices section of R Packages.

Technical Documentation

Add technical vignette which describes the vectorized formulation of CARP / CBASS in breathtaking detail.

Code Clean-Ups

Various things to clean-up after #112:

  • Various signed / unsigned comparison issues flagged by compiler. (Addressed by #122)

  • User configurable stopping parameters for exact = TRUE mode. Specifically, the convergence threshold and maximum iterations should be settable via clustRviz_options. (Addressed by #122)

  • reset_aux is now malignant and should be removed to increase performance of exact = TRUE. (Addressed by #117)

  • Calls to full_admm_step are now redundant and can be elided. (Addressed by #117)

  • The gamma values used in exact = TRUE and exact = FALSE don't line up. (See script in comments on #112)

  • Stopping criteria for clustering_impl.h and biclustering_impl.h don't line up. (Addressed by #117 and #122)

  • We should split MatrixProx into MatrixRowProx (current) and MatrixColProx to avoid unnecessary transposes: (Addressed by #117)

Eigen::MatrixXd MatrixColProx(const Eigen::MatrixXd& X,
                              double lambda,
                              const Eigen::VectorXd& weights,
                              bool l1 = true){
  Eigen::Index n = X.rows();
  Eigen::Index p = X.cols();

  Eigen::MatrixXd V(n, p);

  if(l1){
    for(Eigen::Index i = 0; i < n; i++){
      for(Eigen::Index j = 0; j < p; j++){
        V(i, j) = soft_thresh(X(i, j), lambda * weights(j));
      }
    }
  } else {
    for(Eigen::Index j = 0; j < p; j++){
      Eigen::VectorXd X_j = X.col(j);
      double scale_factor = 1 - lambda * weights(j) / X_j.norm();

      if(scale_factor > 0){
        V.col(j) = X_j * scale_factor;
      } else {
        V.col(j).setZero();
      }
    }
  }

  return V;
}
  • @dansenglund should be added to the DESCRIPTION file as an author. (Addressed by #117)

Will add more as I think of them.

Fails with Uniform Weights

This fails somewhere in the dendrogram reconstruction code:

CARP(presidential_speech, weights = rep(1, 44 * 43 / 2))

I've confirmed that this fails with a commit from before I touched the C++ code, so I don't think it's anything I did.

@jjn13 - Can you take a look? It might be in the back-tracking: alg.type="carp" seems to fix it.

Combine `carp.cpp` and `carp_viz.cpp`

carp.cpp and carp_viz.cpp are >90% the same, so they can probably be reasonably consolidated. (Same for cbass.cpp and cbass_viz.cpp) This will cut down compile time and make future maintenance easier.

Plain Non-Pathwise Solver

It would be good to expose a plain solver for convex (bi-)clustering at a single lambda value. Useful for package comparisons / timing stuff as well as refining solutions in graphics routines.

New gganimate api

I attempted to install the most recent version of gganimate (https://github.com/thomasp85/gganimate/) on one of my linux machines:

devtools::install_github('thomasp85/gganimate')

but this fails due to gifski dependency (https://github.com/r-rust/gifski). Trying to install this via:

devtools::install_github("r-rust/gifski")

asks me install rust environment via:

sudo apt-get install cargo

which also fails, but can be fixed by adding an external repo.

This is not necessarily a no-go, but we will be requiring windows/linux/mac users to have rust; the situation then seems similar to the SuperLU difficulties we encountered w/ RcppArmadillo.

Just wanted to check if this is desirable before translating gganimate to the new API.

Formula interface

Please enable my laziness.

When a dataset has a mixed data types it can be a pain to pass an explicit matrix and then rejoin on that matrix down the line.

Proposals:

# assuming there's some nice column called labels

CARP(labels ~ Sepal.Width + Sepal.Length, data = iris)  

# or
CARP(formula = ~ Sepal.Length + Sepal.Width, data = iris, labels = ~ labels)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.