darios / classifyr Goto Github PK
View Code? Open in Web Editor NEWA framework for cross-validated classification problems, with applications to differential variability and differential distribution testing
A framework for cross-validated classification problems, with applications to differential variability and differential distribution testing
Optimise Andy's prototype code and harmonise into package.
The blue and red colour scheme seems unpleasant and glaring. Change the default values of plotFeatureClasses
and performancePlot
to a more soothing palette.
Also, change default plot background to white with grey grid lines instead of the dreary grey, which is ggplot2's default.
seed <- .Random.seed[1]
is extracting the random number algorithm identifier, not a random number, which starts at third place.
Should we have a function or better documentation to make it easy to see what classifiers are valable?
A better handling of Error in !all.equal(colnames(measurements), make.names(colnames(measurements))) :
invalid argument type
When training a coxNet model using using the classifyR train function. The resulting model has the following class "trainedByClassifyR", "coxnet", "glmnet"
. However when predicting using this model with the predict function I get the following error Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 't': 'data' must be of a vector type, was 'NULL'
.
Removing "glmnet" from the class list seems to solve this issue. As shown below:
output = train(
x = DataFrame(measurements),
outcome = outcome,
classifier = "CoxNet",
selectionMethod = "CoxPH",
nFeatures = 15
)
class(output) = class(output)[1:2]
predict(output, measurements)
Prevent errors about out-of-bounds indices by replacing classes[training]
wtih droplevels(classes[training])
in utilities.R
Implement an ANOVAselection
function.
Modularise stages into non-exported utility functions.
This will allow classification of datasets consisting of multiple assay types (e.g. mRNA, protein) but also clinical datasets which have mixed data types in the same table. Decisions required regarding how to combine multiple assays into a single table that can be processed by existing classifiers available on CRAN.
Hello, I've noticed if the rownames of the assay are not ordered the same as the names in the outcome vector, then ClassifyR will classify samples with the wrong outcome. This is particularly prominent when numeric row names are of type character where the ordering might be "1", "10", "2", "3", rather than 1, 2, 3, 10.
I heard someone had an issue with tibbles. I don't have an example though.... Just putting this here to remind me to follow up.
Some of the summer / international exchange students are reporting problems and the errors only happen for people with MacOS and it seems to be version-specific. I don't have access to Apple computers myself.
Also, the Bioconductor check server merida1 uses MacOS Mojave 10.14.6 and doesn't have the error. Is there a modification to avoid it?
I'd love to be able to generate a plot communicating why I should choose a certain number of features in my model.
ie performance vs n .
This could be used to justify using smaller models.
Provide more introduction and more targeted examples. Consider curatedTCGAdata as a motivating dataset.
Expand the predict function for crossValidate to accept more than a DataFrame
Either need to allow non-factor values, or put in an error which specifies that only a factor can be used
Complete implementation of crissCrossValidate
prototyped by Harry using selectMulti
, train
and predict
functions. Harmonise the parameter naming with existing functions and test with example data.
Hi there, I'd like to use ClassifyR for survival analysis but there arent any examples I can find. If you could help that would be great, thanks
Hi,
Would it be possible to have a feature in ClassifyR that runs many models and ranks their results based on a specified metric such as AUC
or Balanced Accuracy
? This would be extremely useful for those wanting to compare different models without having to run them each individually. This is similar to the AutoML feature in the H2o
package. I've included an example of its output below.
measurements <- list(clinical = clin[useImages,],
prop = cellProp[useImages,],
dist = pairDist[useImages,])
measurements <- lapply(measurements, as.data.frame)
measurements <- lapply(measurements, function(x){rownames(x) <- paste0("sample",rownames(x));x})
It would be nice if we could be a bit more flexible in all data objects in a list needing to be the same class. I'm not a fan of character numeric rownames, but arguably we could be more flexible here too...
The current list of metrics seems impressive, but there are not as many it first appears. For example, fall
(Fallout) is the same as fpr
(False Positive Rate) and miss
is the same as fnr
(False Negative Rate). Also, these metrics are limited to two-class predictions and depend on the R package ROCR which is no longer being maintained. Intel is developing a commercially available Data Analytics Acceleration Library with a set of metrics for multi-class classification tasks. ClassifyR should provide an R version of this.
The only annotations about classification which can be provided are the dataset name and classification name and get stored as slots in ClassifyResult object. Instead, allow a user to provide an aribtrary two-column data frame to runTests
listing any characteristics and allow plotting functions to group by any kind of characteristic.
Often, users will want to make a ROC plot and they only have class predictions by default. It's not noticeably more resource-intensive to store the scores and allow seamless AUC calculations, so change the default of returnType
to "both"
for prediction wrappers.
Hi Dario,
I noticed that when there are hyphens or other unexpected characters in the rownames of the input matrix, there is always an error. As the message isn't quite informative, it becomes hard for the users to diagnose what went wrong. One possible solution is perhaps put in an informative error message whenever this happens?
Cheers, Kevin.
library(randomForest)
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
library(ClassifyR)
#> Loading required package: S4Vectors
#> Warning: package 'S4Vectors' was built under R version 3.6.3
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#>
#> clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#> clusterExport, clusterMap, parApply, parCapply, parLapply,
#> parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following object is masked from 'package:randomForest':
#>
#> combine
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#> dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#> grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#> order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#> rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#> union, unique, unsplit, which, which.max, which.min
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#>
#> expand.grid
#> Loading required package: MultiAssayExperiment
#> Warning: package 'MultiAssayExperiment' was built under R version 3.6.3
#> Loading required package: SummarizedExperiment
#> Loading required package: GenomicRanges
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Warning: package 'GenomeInfoDb' was built under R version 3.6.3
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: DelayedArray
#> Warning: package 'DelayedArray' was built under R version 3.6.3
#> Loading required package: matrixStats
#>
#> Attaching package: 'matrixStats'
#> The following objects are masked from 'package:Biobase':
#>
#> anyMissing, rowMedians
#> Loading required package: BiocParallel
#>
#> Attaching package: 'DelayedArray'
#> The following objects are masked from 'package:matrixStats':
#>
#> colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
#> The following objects are masked from 'package:base':
#>
#> aperm, apply, rowsum
#>
#> Attaching package: 'ClassifyR'
#> The following objects are masked from 'package:Biobase':
#>
#> featureNames, sampleNames
## From help file
# Genes 76 to 100 have differential expression.
genesMatrix <- sapply(1:25, function(sample) c(rnorm(100, 9, 2)))
genesMatrix <- cbind(genesMatrix, sapply(1:25, function(sample)
c(rnorm(75, 9, 2), rnorm(25, 14, 2))))
classes <- factor(rep(c("Poor", "Good"), each = 25))
colnames(genesMatrix) <- paste("Sample", 1:ncol(genesMatrix), sep = '')
## Problem if the features contain bad characters
rownames(genesMatrix) <- paste("Gene-", 1:nrow(genesMatrix), sep = '')
## Setting up runTests
trainParams <- TrainParams(randomForestTrainInterface, ntree = 500, getFeatures = forestFeatures)
predictParams <- PredictParams(randomForestPredictInterface)
params = list(trainParams, predictParams)
tmp = runTests(measurements = genesMatrix,
classes = classes,
datasetName = "tmp",
classificationName = "tmp",
permutations = 2, folds = 5,
params = params,
seed = 2020, verbose = 2)
#> Error: All cross-validations had an error.
Created on 2020-06-10 by the reprex package (v0.3.0)
Classifier-specific methods such as class weights (.g. SVM, random forest) were found to work poorly and many people have mentioned the same on Stack Exhange. By default, enable upsampling to the number of samples of the largest class
Sarah Romanes and John Ormerod implemented multi-class linear/quadratic discriminant analysis. Improve code so that it uses loops instead of copying and pasting the same command multiple times with different indices. Remove all representation of classes as 0
or 1
indicators (i.e. n = sum(vy1 + vy2 + vy3)
can be n = length(samplesClasses)
) and ensure the classifier works with a single factor variable. Convert end-user functions into S4 functions. Use ::
notation for accessing functions from other packages and import them in NAMESPACE
file. Also check for variables that are calculated but never used in any calculations such as Sx1 = t(mX)%*%(vy1+vy2+vy3)
and Sx2 = t(mX2)%*%(vy1+vy2+vy3)
. Are they necessary for a correct calculation or are they left-over from an abandoned idea? Create documentation for all end-user facing functions.
With the large number of options, it becomes hard to remember what the expected inputs and outputs are. A manual that explains the parameter requirements, purpose of internal functions and shows some basic test cases will be helpful when editing the code in future.
It would be great not to have to
performancePlot(res, characteristicsList = list(x = "Assay Name"), performanceName = "C-index")
Doing prevalidation with crossValidate
stops with error about function prevalTrainInterface
not being defined.
Should be caught before Error in fullResult$models : $ operator is invalid for atomic vectors
. Reported by Farhan.
Provide more alternatives.
Add new wrapper functions for DLDA, SVM, Random Forest, Logistic Regression, and Elastic-Net Regularised GLM classification algorithms.
The code below is producing an error for Yunwei.
result <- crossValidate(temp_data, c("time", "status"), classifier = "randomSurvivalForest", nFolds = 5,
nRepeats = 3, performanceType = "C-index", selectionMethod = "CoxPH")
Need an error message for scenarios where input data contains NA values
Some usage sections and examples in the Rd
files have a long line length and exceed the page width. Also, the vignette page width option is being ignored. Ensure that it is being compiled with R Markdown 2 and also wrap the lines nicely.
I've created an example using the iris data. The example shows that when the measurement assays only contain one feature within them, crossValidate wrapper returns an error. There is a different error when multiViewMethod = "merge" and when no multiview method is used.
irisFiltered = iris %>% filter(Species %in% c("setosa", "versicolor"))
rownames(irisFiltered) = make.names(rownames(irisFiltered))
measurementsData = irisFiltered %>% select(-Species) %>% apply(FUN = function(x) x, ., MARGIN = 2, simplify = FALSE) %>% lapply(data.frame)
outcome = irisFiltered$Species names(outcome) = rownames(irisFiltered)
cvResults = ClassifyR::crossValidate(measurements = measurementsData, outcome = outcome, nFeatures = 1, selectionMethod = 't-test', classifier = "GLM", nRepeats = 1, nFolds = 3, nCores = 1)
cvResults = ClassifyR::crossValidate(measurements = measurementsData, outcome = outcome, nFeatures = 1, selectionMethod = 't-test', classifier = "GLM", nRepeats = 1, nFolds = 3, nCores = 1, multiViewMethod = "merge")
This broke when variables had zero variability. I need to add some filtering here in a sensible way.
If would be good to have a train() function to complement predict() for situations where people don't need to run crossValidate.
Effectively this would take most of the same input as crossValidate() and then do a runTest.
One key distinction from crossValidate would be that for multiViewMethod is only fits one model. So by default this should be the full model unless they specify a assayCombination.
It would be great to auto select the colours for ROCplot so the default wasn't Classifier Name.
Perhaps if comparison is not provided, and there are no characteristicsLabel, it could automatically paste the non-unique Characteristics values?
> head(chosenFeatureNames(coxPrevalPredicts[[50]]))
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
[[6]]
NULL
for crossValidate
Recently, a two-stage classifier which combines clinical data with omics data was shown to reduce classification error rates greatly. Implement the idea in a generalisable way that works on mixed clinical data types and with user-specified feature selection and classification.
The code
cbind(predictions(classified), known = actualClasses(classified))
should be
cbind(predictions(classified)[[1]], known = actualClasses(classified))
to give nice output. Change in both stable and experimental versions.
AUC can be calculated by merging all predictions from all iterations, as is done currently, but for large sample sizse data sets, it would be slightly better to calculate the AUC at each fold separately and then average the values. Also, move the AUC performance calculation code from ROCplot.R into calcPerformance.R.
It chose characteristicsLabel which had the same value for both results. But it should have chosen Classifier Name which had two values.
If nFeatures
is a range of numbers and multiViewMethod
is "merge"
then a subscript out of bounds error occurs.
Running these lines of code spits out the following error:
Error in .doSelection(measurementsUse, outcomeTrain, CrossValParams(), : trying to get slot "tuneParams" from an object of a basic class ("NULL") with no slots
classifier = "CoxNet"
selectionMethod = "none"
output = train(
x = DataFrame(measurements),
outcome = outcome,
classifier = classifier,
selectionMethod = selectionMethod,
)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.