dataslingers / clustrviz Goto Github PK

View Code? Open in Web Editor NEW

19.0 4.0 14.0 37.45 MB

Compute Convex (Bi)Clustering Solutions via Algorithmic Regularization

Home Page: https://DataSlingers.github.io/clustRviz/

License: GNU General Public License v3.0

R 79.93% C++ 19.86% C 0.21%

r rstats statistics regularization clustering convex-clustering algorithmic-regularization

clustrviz's Introduction

clustRviz

clustRviz aims to enable fast computation and easy visualization of Convex Clustering solution paths.

Installation

You can install clustRviz from github with:

# install.packages("devtools")
devtools::install_github("DataSlingers/clustRviz")

Note that RcppEigen (which clustRviz internally) triggers many compiler warnings (which cannot be suppressed per CRAN policies). Many of these warnings can be locally suppressed by adding the line CXX11FLAGS+=-Wno-ignored-attributes to your ~/.R/Makevars file. To install an R package from source, you will need suitable development tools installed including a C++ compiler and potentially a Fortran runtime. Details about these toolchains are available on CRAN for Windows and macOS.

Quick-Start

There are two main entry points to the clustRviz package, the CARP and CBASS functions, which perform convex clustering and convex biclustering respectively. We demonstrate the use of these two functions on a text minining data set, presidential_speech, which measures how often the 44 U.S. presidents used certain words in their public addresses.

library(clustRviz)
#> Registered S3 method overwritten by 'seriation':
#>   method         from 
#>   reorder.hclust gclus
data(presidential_speech)
presidential_speech[1:6, 1:6]
#>                     amount appropri  british     cent commerci commission
#> Abraham Lincoln   3.433987 2.397895 1.791759 2.564949 2.708050   2.079442
#> Andrew Jackson    4.248495 4.663439 2.995732 1.945910 3.828641   3.218876
#> Andrew Johnson    4.025352 3.091042 2.833213 3.332205 2.772589   2.079442
#> Barack Obama      1.386294 0.000000 0.000000 1.386294 0.000000   0.000000
#> Benjamin Harrison 4.060443 4.174387 2.302585 4.304065 3.663562   3.465736
#> Calvin Coolidge   3.713572 4.094345 1.386294 3.555348 2.639057   1.609438

Clustering

We begin by clustering this data set, grouping the rows (presidents) into clusters:

carp_fit <- CARP(presidential_speech)
#> Pre-computing weights and edge sets
#> Computing Convex Clustering [CARP] Path
#> Post-processing
print(carp_fit)
#> CARP Fit Summary
#> ====================
#> 
#> Algorithm: CARP (t = 1.05) 
#> Fit Time: 0.142 secs 
#> Total Time: 0.636 secs 
#> 
#> Number of Observations: 44 
#> Number of Variables:    75 
#> 
#> Pre-processing options:
#>  - Columnwise centering: TRUE 
#>  - Columnwise scaling:   FALSE 
#> 
#> Weights:
#>  - Source: Radial Basis Function Kernel Weights
#>  - Distance Metric: Euclidean
#>  - Scale parameter (phi): 0.01 [Data-Driven]
#>  - Sparsified: 4 Nearest Neighbors [Data-Driven]

The algorithmic regularization technique employed by CARP makes computation of the whole solution path almost immediate.

We can examine the result of CARP graphically. We begin with a standard dendrogram, with three clusters highlighted:

plot(carp_fit, type = "dendrogram", k = 3)

Examing the dendrogram, we see two clear clusters, consisting of pre-WWII and post-WWII presidents and Warren G. Harding as a possible outlier. Harding is generally considered one of the worst US presidents of all time, so this is perhaps not too surprising.

A more interesting visualization is the dynamic path visualization, whereby we can watch the clusters fuse as the regularization level is increased:

plot(carp_fit, type = "path", dynamic = TRUE)

BiClustering

The use of CBASS for convex biclustering is similar, and we demonstrate it here with a cluster heatmap, with the regularization set to give 3 observation clusters:

cbass_fit <- CBASS(presidential_speech)
#> Pre-computing column weights and edge sets
#> Pre-computing row weights and edge sets
#> Computing Convex Bi-Clustering [CBASS] Path
#> Post-processing rows
#> Post-processing columns
plot(cbass_fit, k.row = 3)

By default, plotting the result of CBASS gives the traditional cluster heatmap, but we can also get the row or column dendrograms as well:

plot(cbass_fit, type = "row.dendrogram", k.row = 3)

By default, if a regularization level is specified, all plotting functions in clustRviz will plot the clustered data. If the regularization level is not specified, the raw data will be plotted instead:

plot(cbass_fit, type = "heatmap")

More details about the use and mathematical formulation of CARP and CBASS may be found in the package documentation.

clustrviz's People

Contributors

Stargazers

Watchers

Forkers

michaelweylandt nrkarthikeyan mfaroukb bhmbhm coolalexzb kenzeng24 jasminezhuoy dansenglund huishenstats seeseamiao horizonailab huichenghui ncku-bioinformatic-club dipterix

clustrviz's Issues

Introductory Vignette

Vignettes giving introduction to convex clustering / bi-clustering + comparison to other methods. (Can adapt from the original papers on the subject)

Add Appveyor tests

Add appveyor tests

tSNE-Based Weights

Add "tSNE" and "preSNE"-type weights.

As originally discussed in #7, @tianyibourne and @minjiewang1991 have had some success with alternate weighting schemes.

Graphics Sizes in `README`

The graphics sizes in the README and home page are terrible.

Code to Create `presidential_speech` data set

Add code to (re)create presidential_speech data set. This will allow the data set to be re-run as more speeches are recorded. (If we update the "official" version in the package, we may need to version it somehow.)

CARP crash

Running into the following:

> carp_fit <- CARP(tf, labels = track_features$track_uri)
�X� should be a matrix, not a data.frame. Converting with as.matrix(). (Called from CARP)
Pre-computing weights and edge sets
Error in if (num_connected_old == num_connected) { : 
  missing value where TRUE/FALSE needed

Also, it's taking ~10 minutes just to get to this error on a 2600 by 11 matrix -- is that expected?

Can send data for a reprex if necessary.

Avoid Use of aes_string

ggplot::aes_string is deprecated. See https://ggplot2.tidyverse.org/dev/articles/ggplot2-in-packages.html for better approaches.

Error in dynamic path visualization when following tutorial:

I was following the example from the quickstart and got the following error:

>plot(carp_fit, type = "dynamic_path")

nframes and fps adjusted to match transition
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
Error in magick::image_animate(anim, fps, loop = if (loop) 0 else 1) :                                                  
  argument 'fps' must be a factor of 100

[Edited by MW: Added triple-backticks for code formatting]

I don't know if this is an issue of my local configuration, but I thought I should mention it

Plot original features for CARP Paths

I'd like to be able to plot the CARP paths for the raw features as well as the first few principal components.

Hardcoded paths to strip and uname

In Makevars strip and uname are assumed to be in /usr/bin and /bin respectively. This is not the case on some systems (e.g., NixOS), requiring patching to remove the hardcoded paths.

Have GitHub Actions Ask Folks to Retarget PRs to Master

Is it possible to have GitHub Actions detect if a PR has been made to master and ask the author to move it to develop instead? See #98 for an example.

U Smoothing

Add a post-processing step to "smooth" U by replacing the U-hat for clustered elements with the mean U-hat for elements in that cluster. This won't change the implied clusterings, but it will improve some minor problems with the path graphics, where the centroids do not exactly coincide.

(Note that this isn't a real CARP / CBASS issue but rather an deeper issue with running the ADMM for a finite amount of time on these problems: similar to how interior point methods will never get exact zeros even when run to essentially numerical convergence on the lasso and need a final thresholding step.)

Once this is done, we can speed up plotting by only showing the distinct path elements instead of every path element. The plot_frame elements passed to ggplot can go through a dplyr::distinct first to speed things up / avoid massive over-plotting. This should alleviate some of the slowness issues at the heart of #56.

TODO:

Add CARP U smoothing
Add CBASS U smoothing
Update get_cluster_centroids and get_clustered_data
Speed up path plotting

Visual Tests

We need real tests (not just "runs without error" tests) for graphics functions. It looks like the vdiffr package is the best way to do so. I'll work on adding some for the static graphics after I finish #43.

@jjn13 Any ideas on testing the results of the interactive / dynamic parts?

Add Logging

Add proper logging to the package

Remove `static` and `interactive` flags

The static and iteractive flags to CARP and CBASS were originally added to speed up fitting for cases when plotting would not be needed. If we can make these calculations cheap (vs. the actual clustering), there's no reason to have these flags and we can simplify the code base significantly.

Now (2018-09-12), the remaining culprit appears to be the PCA-projection in CARP (used to visualize cluster paths).

Add Clustering & Bi-Clustering Performance Benchmarks

We should add some benchmarks indicating the advantages of convex (bi)clustering over classical methods. Depending on the run-time, these can be a vignette or just included as a "run for yourself" file.

@jjn13 I believe you had some of these from early experiments. Could you share? If you email them to me, I'll work them up into a vignette.

Handling Well-Separated Observations

When a data point is far from the bulk, e^(-d^2) can underflow to zero for all (potential) neighbors yielding an unconnected graph. While this is numerically a problem, it's not conceptually one so we should handle it better than just throwing an error.

@agenevera suggested adding a small epsilon nugget to the weights (possibly to all, or at least to all the zero weights) in the dense calculations. This would still be mostly sparsified away, but would fix connectedness issues.

(See prior discussion and reproducible example in #54)

Accelerated ADMMs for exact = TRUE

As shown in https://arxiv.org/abs/1901.06075, it can be useful, especially on larger data sets, to include an option for an accelerated ADMM. It might be worth exploring this for CARP and CBASS, when exact = TRUE.

AppVeyor improvements

Two medium-/long-term TODOs:

The README AppVeyor badge is showing failure even when the latest build is successful
Can we use AppVeyor to test PRs as well?

Improve C++ Hygiene

A user has reported issues with compilation on Windows that appear to result from the lack of a system typedef for uint. (Windows provides typedef unsigned int UINT instead) Possible fixes being tried out in my mw/fix_uint branch.

This is a good opportunity to check general int and signed/unsigned hygiene. See also DataSlingers/ExclusiveLasso#12.

clustering.CBASS returns values on the transformed data

I would expect this to be true:

cbass_fit <- CBASS(presidential_speech)
all(clustering(cbass_fit, percent = 0)$cluster.mean.matrix == t(presidential_speech))
#> [1] FALSE

(Note that the t() shouldn't be there either as noted in #9)

Clarify Installation and Development Docs

Is there any plan to submit this package to CRAN?

Confirm GitHub Actions builds, but doesn't deploy, on PRs

See incorrect push on #98 and lack of testing on #100

Related to #99

Increase Number of Frames in Plots

By default, gif export and shiny app plot a relatively small number of frames (21 I think), resulting in a pretty choppy "movie experience" -- let's increase this.

Path plots for `CBASS`

I'd like to be able to create CARP-style path plots for the both the observation and feature clustering performed by CBASS.

Add L_{\infty} Fusion Penalty

Currently we support $\ell_q$ fusion penalties for $q = 1, 2$ only. It should be direct to add support for the $q = \infty$ case (i.e., implement the $\ell_{\infty}$ prox) using the projection method of Duchi et al. (https://stanford.edu/~jduchi/projects/DuchiShSiCh08.html)

Progress Updates

Add dynamic output to assure users that progress is being made. I don't know how to do an "honest" progress bar for this problem, so something of the form

[SPINNER] Number of Edges Fused: E / E_max. Percent Variance Remaining: V_norm / V_norm_0. Iterations Performed: I. Current Gamma: G.

(Perhaps not all of that if it's too long - probably clip to 80 characters)

Add code to recreate clustRviz paper figures

Polish up the code used to make figures in https://arxiv.org/abs/1901.01477 and put it somewhere in the package / on the wiki for this repo.

Fix Algorithm String in `print.CBASS`

Following #112, the algorithm name printed by print.CBASS after CBASS(exact = TRUE) is incorrect. (Should be "generalized ADMM".)

Fix CI Segfault in Rendering CARP Plots on Ubuntu

The Ubuntu test-bed on GitHub Actions trips a segfault when trying to render CARP plots under R 3.6.3. This doesn't seem to be a problem in our code, but it makes our CI history much "dirtier" than it should be and slows down PR reviews.

Skip RNG Set Up

By default, Rcpp transfers the RNG state to and from R when calling into C++. This is a little bit expensive and not necessary for us since we don't use RNGs in C++.

If we change our C++ attributes to // [[Rcpp::export(rng = false)]], things will be a smidge faster. (I'd imagine this is still dwarfed by compute time, but it can't hurt.)

See example at https://github.com/tidyverse/dplyr/blob/e08f0511986a10efda5b9fd991af2e049d801333/src/address.cpp#L18

Update Builds

It seems GH can only serve GH pages from master branch so rename branches:

stable --> master
master --> develop

Also need to update TravisCI scripts

Disable Back-Tracking by Default

@agenevera thinks that the backtracking versions (CARP-VIZ and CBASS-VIZ) are too expensive to use by default on larger data sets.

@jjn13 Thoughts on changing the default to disable back-tracking?

"Tidy" Accessors

Add as.data.frame and similar accessors for programmatic usage of clustering solutions.

Change `CBASS` `obs` and `var` to `row` and `col` consistently

@agenevera wants us to change "obs" and "var" to "row" and "col" throughout since the notion of "observation" and "feature" might not apply wherever bi-clustering is used. This is not a hard change to make, but we should do it after we decide whether to use $X$ or $X^T$ internally throughout, since working with $X^T$ will make it very hard to keep "row" and "col" clear in the code.

Let's wait until the technical documentation (#28) is complete before deciding how to proceed.

Better Handling when `max_iter` is hit

When max_iter is hit, we throw an error in post-processing. This should be fixed...

clustering.CBASS returns a transposed data matrix

clustering.CBASS returns a transposed data matrix, contrary to the documentation:

cbass_fit <- CBASS(presidential_speech)
dim(clustering(cbass_fit, percent = 0.5))
#> [1] 75 44

dim(presidential_speech)
#> [1] 44 75

Add Code Coverage Reporting

Add codecov.io support (using covr package)

Tighten alpha bound for biclustering

The current alpha bound works in practice, but it is not as tight as theoretically possible. However, tighter bounds seemed to result in periodic failure to converge. Need to determine the right alpha get performance guarantees.

Increase Frequency of Interrupt Checks

For some long running operations, I was unable to stop the calculations once they started. Might want to check that Rcpp::checkUserInterrupt() gets called semi-frequently in the main looping operations as recommended in the Best Practices section of R Packages.

Cache Shiny Plots

Our shiny app is often pretty slow to render plots, resulting in the slider and the plot getting out of sync. This seems like it would help: https://blog.rstudio.com/2018/11/13/shiny-1-2-0

Technical Documentation

Add technical vignette which describes the vectorized formulation of CARP / CBASS in breathtaking detail.

Code Clean-Ups

Various things to clean-up after #112:

Various signed / unsigned comparison issues flagged by compiler. (Addressed by #122)
User configurable stopping parameters for exact = TRUE mode. Specifically, the convergence threshold and maximum iterations should be settable via clustRviz_options. (Addressed by #122)
reset_aux is now malignant and should be removed to increase performance of exact = TRUE. (Addressed by #117)
Calls to full_admm_step are now redundant and can be elided. (Addressed by #117)
The gamma values used in exact = TRUE and exact = FALSE don't line up. (See script in comments on #112)
Stopping criteria for clustering_impl.h and biclustering_impl.h don't line up. (Addressed by #117 and #122)
We should split MatrixProx into MatrixRowProx (current) and MatrixColProx to avoid unnecessary transposes: (Addressed by #117)

Eigen::MatrixXd MatrixColProx(const Eigen::MatrixXd& X,
                              double lambda,
                              const Eigen::VectorXd& weights,
                              bool l1 = true){
  Eigen::Index n = X.rows();
  Eigen::Index p = X.cols();

  Eigen::MatrixXd V(n, p);

  if(l1){
    for(Eigen::Index i = 0; i < n; i++){
      for(Eigen::Index j = 0; j < p; j++){
        V(i, j) = soft_thresh(X(i, j), lambda * weights(j));
      }
    }
  } else {
    for(Eigen::Index j = 0; j < p; j++){
      Eigen::VectorXd X_j = X.col(j);
      double scale_factor = 1 - lambda * weights(j) / X_j.norm();

      if(scale_factor > 0){
        V.col(j) = X_j * scale_factor;
      } else {
        V.col(j).setZero();
      }
    }
  }

  return V;
}

@dansenglund should be added to the DESCRIPTION file as an author. (Addressed by #117)

Will add more as I think of them.

Fails with Uniform Weights

This fails somewhere in the dendrogram reconstruction code:

CARP(presidential_speech, weights = rep(1, 44 * 43 / 2))

I've confirmed that this fails with a commit from before I touched the C++ code, so I don't think it's anything I did.

@jjn13 - Can you take a look? It might be in the back-tracking: alg.type="carp" seems to fix it.

Combine `carp.cpp` and `carp_viz.cpp`

carp.cpp and carp_viz.cpp are >90% the same, so they can probably be reasonably consolidated. (Same for cbass.cpp and cbass_viz.cpp) This will cut down compile time and make future maintenance easier.

No visible plots for interactive CARP plots

I tried to play with a carp fit on the first 100 points from the data in #54, but the Shiny app wasn't displaying any graphics, just the regularization slider.

Plain Non-Pathwise Solver

It would be good to expose a plain solver for convex (bi-)clustering at a single lambda value. Useful for package comparisons / timing stuff as well as refining solutions in graphics routines.

New gganimate api

I attempted to install the most recent version of gganimate (https://github.com/thomasp85/gganimate/) on one of my linux machines:

devtools::install_github('thomasp85/gganimate')

but this fails due to gifski dependency (https://github.com/r-rust/gifski). Trying to install this via:

devtools::install_github("r-rust/gifski")

asks me install rust environment via:

sudo apt-get install cargo

which also fails, but can be fixed by adding an external repo.

This is not necessarily a no-go, but we will be requiring windows/linux/mac users to have rust; the situation then seems similar to the SuperLU difficulties we encountered w/ RcppArmadillo.

Just wanted to check if this is desirable before translating gganimate to the new API.

Investigate Exact ADMM for Bi-Clustering

Test and Implement the new CBASS based on https://arxiv.org/abs/1901.06075

Formula interface

Please enable my laziness.

When a dataset has a mixed data types it can be a pain to pass an explicit matrix and then rejoin on that matrix down the line.

Proposals:

# assuming there's some nice column called labels

CARP(labels ~ Sepal.Width + Sepal.Length, data = iris)  

# or
CARP(formula = ~ Sepal.Length + Sepal.Width, data = iris, labels = ~ labels)