kaneplusplus / biganalytics Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
Dear all,
I've been using bigkmeans, but the initial implementation sometimes has some trouble finding unique centers. This happens because it randomly samples k
elements and sees if they are all unique; if not, tries again (up to nchecks
times). This approach works fine most of the times, but if k
is high and the dataset has a lot of similar elements, it is very difficult to find a set composed only of unique elements.
A way to go around this, would be to first get a list of unique elements, and then sample it k
times. I've changed the getcenters
function to do this (e.g.):
getcenters <- function(x, k, nstart) {
x <- x[!duplicated(as.matrix(x)), ];
n <- nrow(x);
if (n<k) {
stop("not enough unique centers.\n");
}
centers <- list(x[sample(1:n, k),,drop=FALSE]);
if (nstart>1) {
for (i in 2:nstart) {
centers[[length(centers)+1]] <- x[sample(1:n, k),,drop=FALSE];
}
}
return(centers);
}
Cheers,
Dear all,
right now the use can only supply a set of centers if nstart == 1
. I've modified the code so that the user can also supply a list (length == nstart
) to be used as centers.
#################################################
# Check centers for sanity and consider nstart>1:
if (is.matrix(centers)) {
if (nstart>1) {
stop("Only one set of starting points provided, but nstart>1.\n"))
} else {
if (any(duplicated(centers))) {
stop("Error: if you provide centers, they had better not have duplicates.\n"))
}
centers <- list(centers)
}
} else if (is.list(centers)) {
if(nstart > length(centers)) {
stop("Not enough starting points provided.\n")
} else {
if(any(unlist(lapply(centers, function(x){any(duplicated(x))} )))) {
stop("Error: if you provide centers, they had better not have duplicates.\n"))
}
}
} else {
if (is.numeric(centers) && length(centers)==1 && centers>0) {
k <- centers
centers <- getcenters(x, k, nstart)
} else stop("centers must be a matrix/list of centers or number of clusters > 0")
}
Cheers,
The CI build dependency should be the version of bigmemory on GitHub, not CRAN. We are testing against the wrong package.
In response to this SO question it may be prudent to explore an mapply
binding.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.