utrechtuniversity / scca Goto Github PK

View Code? Open in Web Editor NEW

3.0 7.0 2.0 2.56 MB

Spectral Clustering Correspondence Analysis

License: Other

R 100.00%

correspondence-analysis spectral-clustering utrecht-university

scca's Introduction

SCCA: Spectral Clustering Correspondence Analysis in R

Introduction

The SCCA package implements in R the methodological approach to CA as proposed in Correspondence analysis, spectral clustering and graph embedding: applications to ecology and economic complexity van Dam et al; 2021.

Installation

The package can be installed directly from Github with the code below. Ensure the package devtools has been installed.

#install.packages("devtools")
library(devtools)
install_github("UtrechtUniversity/scca", build_vignettes = TRUE)

Documentation of exported functions and data set

After loading the package a list of all exported functions and data sets can be retrieved by ?SCCA and the documentation of an individual function by ?<function name>; e.g. ?scca_compute.

The methodology and the use of the functions and the data are explained in the included vignette. After installing package SCCA use browseVignettes('SCCA') in the R(Studio) console.

License

The software code is licensed under MIT. The next section (References) provides links to sources of the included datasets. See there for licences of those data sets.

References

Software

van Dam, Alje, Dekker, Mark, Morales-Castilla, Ignacio, Rodríguez, Miguel Á., Wichmann, David and Baudena, Mara (2021); Correspondence analysis, spectral clustering and graph embedding: applications to ecology and economic complexity; Scientific Reports; DOI: 10.1038/s41598-021-87971-9

Included data set

Faurby, Søren e.a; 2019; HYLACINE 1.2: The Phylogenetic Atlas of Mammal Macroecology

The team

The team members are:

Mathematical foundations of the code
- Alje van Dam, Copernicus Institute of Sustainable Development and Centre for Complex Systems Studies, Utrecht University, the Netherlands
- Mark Dekker, Department of Information and Computing Sciences and Centre for Complex Systems Studies, Utrecht University, the Netherlands
Programming and packaging
- Kees van Eijden Research Engineering/ITS, Utrecht University, the Netherlands
With contributions of
- Ignacio Morales Castilla, Global Change Ecology and Evolution Group, Department of Life Sciences, University of Alcala´, Spain
- Jonathan de Bruin, Research Engineering/ITS, Utrecht University, the Netherlands
- Raoul Schram, Research Engineering/ITS, Utrecht University, the Netherlands
- Mara Baudena, National Research Council of Italy, Institute of Atmospheric Science and Climate (CNR-ISAC), Turin, Italy; Copernicus Institute of Sustainable Development and Centre for Complex Systems Studies, Utrecht University, the Netherlands

How to cite SCCA

To cite the SCCA repository and R package, use citation("SCCA") to retrieve the BibTex entry. Otherwise use the following format:

van Eijden, Kees et al; 2021; SCCA: Spectral Clustering Correspondence Analysis in R; Utrecht University; DOI: 10.5281/zenodo.4665670. Also available at Utrecht University.

Please also cite the paper van Dam et al, 2021 when using the SCCA repository.

scca's People

Contributors

Stargazers

Watchers

Forkers

baudenam kequach

scca's Issues

Flat file export of tree

We would like to export a flat file of the tree.

CSV files are in SurfDrive.

See python code (last function in the class).

stability

drop <- sample(ncol(carnivora), ncol(carnivora) %/% 10)
stability <- scca_stability_test(m = carnivora, drop_vars = drop)
Error in clustering_overlap(cl.x, cl.y, plot = plot) :
Clusterings x and y not from the same dataset and category.

Add skeleton of the function(s) required for analysis

Typo in vignette

@KvEijden

Hi Kees, I'm leaving an issue here regarding the vignette. It should be corrected as the part where it shows how to load the package is written:

library(sccar)

and should show instead:

library(SCCA)

Let me know if you want me to correct it.

Error decomp = svds

scca <- scca_compute(carnivora, decomp = 'svds')
Error in fun(A, k, nu, nv, opts, mattype = "dgCMatrix") :
nrow(A) and ncol(A) should be at least 3
In addition: Warning messages:
1: In fun(A, k, nu, nv, opts, mattype = "dgCMatrix") :
all singular values are requested, svd() is used instead
2: In fun(A, k, nu, nv, opts, mattype = "dgCMatrix") :
all singular values are requested, svd() is used instead

Disconnected columns and rows

Sometimes in the clustering process there will be sub matrices with disconnected columns and/or disconnected rows. How to handle these cases?

If disconnections occur, the function will print a warning to the console.

iterations

Sometimes kmeans gives a warning: 'did not converge in 10 iterations'. How should the functions in the package handle this situation?
Should parameter max.iter of kmeans be a parameter of scc_compute?

Trade flow dataset name

@aljevandam What do you prefer for the trade flow dataset?

unilaterally_trade?

Publish vignettes on gh-pages

There is a simple trick to publish the vignette on gh-pages (https://hafen.github.io/packagedocs/#more_on_vignettes). This might be our best solution for hosting the vignette. Shall we give this a try? We need to set-up Travis or GH Actions, but this straightforward.

@qubixes You do have experience with gh-pages and Travis, interested in contributing on this?

Coordinates of carnivora

Should we include the coordinates of carnivora? Is this part of the analysis?

Proposal:

If required, we ship two datasets (carnivora and carnivora_coords) to replace the list structure (complicated for non-R programmers).
If not required, we remove the coordinates and carnivora returns the matrix directly.

@KvEijden can you give feedback on this?

Raw datasets not available

The raw datasets are not available in the package. Therefore, it is not possible to reproduce the packaging. Can we access the datasets from the web?

normalization eigenvectors when computed using 'trick'

In SCCA_compute, when the eigenvalues are computed using the 'trick' (i.e. first doing decomposition of the 'small matrix' to obtain the matrix of vectors U and then obtaining the eigenvectors by pre-multiplying with D^-1 A, so that D^-1 A U = V. This gives the correct eigenvectors but not normalized properly. They should be normalized by setting v = v / (v^T D v ) (where D is the appropriate diagonal matrix with either row or column sums on the diagonal)

Add tests with testThat

Technicalities to be dealt with considering eigenvalue computation and normalization (mostly for own reference, to be discussed)

Normalization of eigenvectors: The normalization of the eigenvectors is currently done within the create_y function. It would be more efficient to normalize them directly when they are caclulated, so that eigenvectors are also normalized in other outputs. Perhaps even give the 'standardized' vectors (i.e. multiplie normalized eigenvalues by sqrt of eigenvalue).
Computation of eigenvectors: This concerns also the question of symmetry of the Laplacian mentioned in the function. Looking at the rARPACK documentation, it might be computationally efficient to compute the eigenvectors from a symmetric matrix instead of the laplacian, and then post-processing them to obtain eigenvectors of L, as follows:

D^{-1/2}SD^{-1/2} = U \Lambda U^T, where U^T U = I
so that
L = D^{-1} S = D^{-1/2} U \Lambda U^T D^{1/2}
so the right eigenvectors are given by
V = D^{-1/2}U and then normalized setting v = v / sqrt(v^T D v) so that V^T D V = I (as required). Normalization is unnecessary if rARPACK returns normalized eigenvectors (to be checked)

calculating embedding space: the function compute_y then only needs to create a matrix who's columns exist of eigenvector 1 to k (dismissing the zero'th (trivial) eigenvector), and possibly weigh the columns by the sqrt of their eigenvalues (depending on whether we scale them in 1 or not).

Error when plotting overlap

Error: Can't convert to .
Run rlang::last_error() to see where the error occurred.
15.
stop(fallback)
14.
signal_abort(cnd)
13.
abort(message, class = c(class, "vctrs_error"), ...)
12.
stop_vctrs(message, class = c(class, "vctrs_error_incompatible"),
x = x, y = y, details = details, ...)
11.
stop_incompatible(x, y, x_arg = x_arg, y_arg = y_arg, details = details,
..., message = message, class = c(class, "vctrs_error_incompatible_type"))
10.
stop_incompatible_type(x = x, y = to, ..., x_arg = x_arg, y_arg = to_arg,
action = "convert", details = details, message = message,
class = class)
9.
stop_incompatible_cast(x, to, x_arg = x_arg, to_arg = to_arg,
vctrs:::from_dispatch = match_from_dispatch(...))
8.
vec_default_cast(x = x, to = to, x_arg = x_arg, to_arg = to_arg,
vctrs:::from_dispatch = vctrs:::from_dispatch, vctrs:::df_fallback = vctrs:::df_fallback,
vctrs:::s3_fallback = vctrs:::s3_fallback)
7.
(function ()
vec_default_cast(x = x, to = to, x_arg = x_arg, to_arg = to_arg,
vctrs:::from_dispatch = vctrs:::from_dispatch, vctrs:::df_fallback = vctrs:::df_fallback,
vctrs:::s3_fallback = vctrs:::s3_fallback))()
6.
vec_cast(fill, val)
5.
pivot_wider_spec(data, spec, !!id_cols, names_repair = names_repair,
values_fill = values_fill, values_fn = values_fn)
4.
tidyr::pivot_wider(data = overlap_xy, names_from = .data$cluster.y,
names_prefix = "y_", values_from = .data$edge, values_fill = list(edge = 0)) at scca_overlap_test.R#187
3.
plot_overlap(cl.xy) at scca_overlap_test.R#157
2.
clustering_overlap(cl.x, cl.y, plot = plot) at scca_overlap_test.R#116
1.
scca_overlap_test(x = scca, y = scca1, plot = TRUE)

List of required narrative

By Alje:

Non-technical project description (5-10 lines) (README, Tutorial)
Description of the software/package (+-5 lines) (readme, DESCRIPTION, ...)
Single line description of the software (DESCRIPTION)
Description, format and references of the Carnivores dataset. See iris.
Information about the researchers/faculties/CCSS

Accept matrices without colnames and rownames in the function scca

License

Under what license conditions will we publish our package?

Validity

Warning messages:
1: In max(distance[c1, c1]) : no non-missing arguments to max; returning -Inf
2: In min(distance[c1, c2]) : no non-missing arguments to min; returning I

in:validity <- scca_validity_test(scca = scca_species, dist = d_species)

Problems with exports dataset because there are no rownames

Hi, I succeeded in installing the package. I thought I'd just list here what I ran into:

I can load the data 'carnivora' and 'exports'
The exports dataset seems to contain only the labels (products in entry 1 and countries in entry 2)
Upon running scca_compute on the exprots I get back the labels.
Upon running scca_compute on the carnivora I get the error message 'M must contain row and
column labels'

Let me know if this helps..I might be looking at and older version?

utrechtuniversity / scca Goto Github PK

scca's Introduction

SCCA: Spectral Clustering Correspondence Analysis in R

Introduction

Installation

Documentation of exported functions and data set

License

References

Software

Included data set

The team

How to cite SCCA

scca's People

Contributors

Stargazers

Watchers

Forkers

scca's Issues

Recommend Projects

Recommend Topics

Recommend Org