fsavje / scclust-r Goto Github PK
View Code? Open in Web Editor NEWSize-constrained Clustering in R
License: GNU General Public License v3.0
Size-constrained Clustering in R
License: GNU General Public License v3.0
I can't seem to see this implemented here or in any R pkg for clustering. Do you think it's possible?
Hello there,
I'm using the latest version of scclust
(0.2.2) and distances
(0.1.8) to construct many size-constrained clusters.
As I want to get the smallest possible clusters while respecting the minimum size constraint, I first use sc_clustering
then hierarchical_clustering
which works flawlessly for most of the groups.
However, I noticed that, on certain groups, calling hierarchical_clustering
with an existing_clustering
attribute crashes my R session.
You may find here: issue_data.csv, data that produces a crash on the following code snippet.
library(distances)
library(scclust)
X <- read.csv2("issue_data.csv") %>% as.matrix()
X_distances <- distances(X)
clustering <- sc_clustering(X_distances, size_constraint=10, seed_method="inwards_updating")
# crashes happen with the following line
h_clustering <- hierarchical_clustering(X_distances, size_constraint=10, existing_clustering=clustering)
The problem doesn't seem to be related to memory issues since the snippet above works on way bigger groups.
Moreover, the final line works when no existing_clustering
is given.
I tried to run hierarchical_clustering
on each cluster produced by clustering
independently and met no issue.
Thank you for your support.
Hi Fredrik!
Is there a way to set the exact number of clusters desired, on top of the cluster minimum size? Maybe in a second agglomerative step?
The problem is I am only interested in a few clusters, typically 3-10. I tried in the example to set size_constraint=50000/10
but it seems sc_clustering
is very slow at creating clusters with large minimum size? Try:
example(sc_clustering)
my_clustering <- sc_clustering(my_dist, size_constraint=50000/10)
Thanks!
Dear Frederik,
Thanks for the nice package. I noted that the base 'sc_clustering' function requires a 'distances' object. However, I am comparing lands based on environmental data and compute a multivariate ecological distance between each pairs of land. The output looks like this (fake example):
data.frame(land1=c("A","A","A","B","B","B","C,","C","C"),land2=c("B","C","D","B","C","D","B","C","D",),ecodist=c(0.2,0.3,0.4,0.2,0.1,0.6,0.6,0.5,0.1)).
From that I can re-create a distance matrix where ecodist serves as distance. The format is essentially the same as a distances object, and could theoretically be passed into the sc_clustering function, isn't it?
Would be very usefull to have more flexilibility to pass alternative distance matrix format.
Many thanks!
Ervan
Dear Fredrik,
I have tried this package with a simple problem of 50 data points. I have solved it with a genetic algorithm for two clusters, of which one has 26 points and the other 24. The global minimum has a sum of distances of about 168.6613. A local minimum has a sum of distances of about 185.4753.
Using sc_clustering
, I need to fix the size constraint to 20 (my_clustering <-
sc_clustering(my_dist, 20)
). The solution I get points to the local minimum, not the global one. Is there something in the configuration of sc_clustering
that I have not correctly set up?
If you need it, I am attaching the dataset for your evaluation.
Regards,
Julio
Hi
I am just trying to rerun the example, and get a warning:
Error in check_clustering(clustering = my_clustering):
size_constraint
must be scalar.
Thanks!
library(scclust)
#> Loading required package: distances
my_data <- data.frame(id = 1:100000,
type = factor(rbinom(100000, 3, 0.3),
labels = c("A", "B", "C", "D")),
x1 = rnorm(100000),
x2 = rnorm(100000),
x3 = rnorm(100000))
# Construct distance metric
my_dist <- distances(my_data,
id_variable = "id",
dist_variables = c("x1", "x2", "x3"))
# Make clustering with at least 3 data points in each cluster
my_clustering <- sc_clustering(my_dist, 3)
check_clustering(my_clustering)
#> Error in check_clustering(clustering = my_clustering): `size_constraint` must be scalar.
Created on 2023-05-22 with reprex v2.0.2
Hello,
I have a similarity matrix df0
such as
object1 object2 similarity
x1 y1 0.09
x2 y2 0.25
I can create the a dist
object with
nams <- with(df0, unique(c(as.character(object1), as.character(object2))))
df1 <- with(df0, structure(similarity, Size = length(nams), Labels = nams, Diag = FALSE, Upper = FALSE, method = "user", class = "dist"))
Then, I can use for example kmeans with this dist
matrix. However, in order to use sc_clustering
from scclust
, I need a distances
object. Do you know how I can create it, either directly from the similarity matrix, or from the dist
object?
Thanks in advance.
I'm just trying to install the package after cloning it from github:
R CMD INSTALL .
I tried explicitly setting -std=c99
like this:
# ~/.R/Makevars
CC=gcc -std=c99
Still, I got:
src/digraph_core.c: In function ‘iscc_digraph_is_valid’:
src/digraph_core.c:62:2: error: ‘for’ loop initial declarations are only allowed in C99 or C11 mode
for (size_t i = 0; i < dg->vertices; ++i) {
^
src/digraph_core.c:62:2: note: use option -std=c99, -std=gnu99, -std=c11 or -std=gnu11 to compile your code
The problem is in src/libscclust/Makefile
:
scclust-R/src/libscclust/Makefile
Lines 33 to 34 in 796f459
I changed it to be this:
%.o: %.c
$(CC) -std=c99 -c $(ALL_CPPFLAGS) $(ALL_CFLAGS) $(XTRA_FLAGS) $< -o $@
And then the build worked without errors.
So, the Makefile
needs to be modified so it can "see" the configuration variables inside ~/.R/Makevars
.
Unfortunately, I can't find any good documentation for how to do this the right way...
The R-exts documentation is difficult to read, but maybe it has the answer.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.