darios / classifyr Goto Github PK

A framework for cross-validated classification problems, with applications to differential variability and differential distribution testing

R 93.87% C++ 3.48% C 2.56% CSS 0.10%

classifyr's People

Contributors

Stargazers

Watchers

Forkers

garthtarr sydneybiox

classifyr's Issues

Multi-platform Precision Pathway

Optimise Andy's prototype code and harmonise into package.

Change Default Colour Scheme

The blue and red colour scheme seems unpleasant and glaring. Change the default values of plotFeatureClasses and performancePlot to a more soothing palette.

Also, change default plot background to white with grey grid lines instead of the dreary grey, which is ggplot2's default.

Results Same Even With Different Seed

seed <- .Random.seed[1] is extracting the random number algorithm identifier, not a random number, which starts at third place.

Easy way to find classifiers

Should we have a function or better documentation to make it easy to see what classifiers are valable?

colname error

A better handling of Error in !all.equal(colnames(measurements), make.names(colnames(measurements))) :
invalid argument type

Class list for CoxNet models, affecting the predict function.

When training a coxNet model using using the classifyR train function. The resulting model has the following class "trainedByClassifyR", "coxnet", "glmnet" . However when predicting using this model with the predict function I get the following error Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 't': 'data' must be of a vector type, was 'NULL'.

Removing "glmnet" from the class list seems to solve this issue. As shown below:

output = train(
  x = DataFrame(measurements),
  outcome = outcome,
  classifier = "CoxNet",
  selectionMethod  = "CoxPH",
  nFeatures = 15
)

class(output) = class(output)[1:2]

predict(output, measurements)

Drop Unused Factor Levels in Validation Data Set Classification

Prevent errors about out-of-bounds indices by replacing classes[training] wtih droplevels(classes[training]) in utilities.R

Simple Feature Selection Function for Changes in Means of 3 or More Classes

Implement an ANOVAselection function.

runTests Too Long

Modularise stages into non-exported utility functions.

Replace ExpressionSet Data Storage by MultiAssayExperiment

This will allow classification of datasets consisting of multiple assay types (e.g. mRNA, protein) but also clinical datasets which have mixed data types in the same table. Decisions required regarding how to combine multiple assays into a single table that can be processed by existing classifiers available on CRAN.

Cross validate doesn't match up assay row names to names of outcome vector.

Hello, I've noticed if the rownames of the assay are not ordered the same as the names in the outcome vector, then ClassifyR will classify samples with the wrong outcome. This is particularly prominent when numeric row names are of type character where the ordering might be "1", "10", "2", "3", rather than 1, 2, 3, 10.

Tibbles

I heard someone had an issue with tibbles. I don't have an example though.... Just putting this here to remind me to follow up.

First Cross-validation Example in Vignette Fails on Some Versions of MacOS

Some of the summer / international exchange students are reporting problems and the errors only happen for people with MacOS and it seems to be version-specific. I don't have access to Apple computers myself.

Christine has MacOS Monterey 12.6 and doesn't have the error.
Cabiria has MacOS Monterey 12.6.2 and has the error.
Ian has MacOS Ventura 13.0.1 and has the error. Connecting to his department's Linux server in Hong Kong is problem-free.

Also, the Bioconductor check server merida1 uses MacOS Mojave 10.14.6 and doesn't have the error. Is there a modification to avoid it?

Pull out tuning results

I'd love to be able to generate a plot communicating why I should choose a certain number of features in my model.

ie performance vs n .

This could be used to justify using smaller models.

Rewrite Vignette

Provide more introduction and more targeted examples. Consider curatedTCGAdata as a motivating dataset.

predict for crossValidate

Expand the predict function for crossValidate to accept more than a DataFrame

runTests doesn't work for `classes` which is not a factor

Either need to allow non-factor values, or put in an error which specifies that only a factor can be used

Integrate crissCrossValidate

Complete implementation of crissCrossValidate prototyped by Harry using selectMulti, train and predict functions. Harmonise the parameter naming with existing functions and test with example data.

No vignette/examples for survival analysis

Hi there, I'd like to use ClassifyR for survival analysis but there arent any examples I can find. If you could help that would be great, thanks

Model Comparison Feature

Hi,
Would it be possible to have a feature in ClassifyR that runs many models and ranks their results based on a specified metric such as AUC or Balanced Accuracy? This would be extremely useful for those wanting to compare different models without having to run them each individually. This is similar to the AutoML feature in the H2o package. I've included an example of its output below.

Data formatting for crossValidate

measurements <- list(clinical = clin[useImages,],
prop = cellProp[useImages,],
dist = pairDist[useImages,])

measurements <- lapply(measurements, as.data.frame)
measurements <- lapply(measurements, function(x){rownames(x) <- paste0("sample",rownames(x));x})

It would be nice if we could be a bit more flexible in all data objects in a list needing to be the same class. I'm not a fan of character numeric rownames, but arguably we could be more flexible here too...

Replace Performance Metrics by Intel DAAL Metrics

The current list of metrics seems impressive, but there are not as many it first appears. For example, fall (Fallout) is the same as fpr (False Positive Rate) and miss is the same as fnr (False Negative Rate). Also, these metrics are limited to two-class predictions and depend on the R package ROCR which is no longer being maintained. Intel is developing a commercially available Data Analytics Acceleration Library with a set of metrics for multi-class classification tasks. ClassifyR should provide an R version of this.

Classifier Characteristics Limited to Dataset and Classification Name

The only annotations about classification which can be provided are the dataset name and classification name and get stored as slots in ClassifyResult object. Instead, allow a user to provide an aribtrary two-column data frame to runTests listing any characteristics and allow plotting functions to group by any kind of characteristic.

Classifier Return Type Default is Categorical Class

Often, users will want to make a ROC plot and they only have class predictions by default. It's not noticeably more resource-intensive to store the scores and allow seamless AUC calculations, so change the default of returnType to "both" for prediction wrappers.

Dealing with ill-formated characters

Hi Dario,
I noticed that when there are hyphens or other unexpected characters in the rownames of the input matrix, there is always an error. As the message isn't quite informative, it becomes hard for the users to diagnose what went wrong. One possible solution is perhaps put in an informative error message whenever this happens?

Cheers, Kevin.

library(randomForest)
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
library(ClassifyR)
#> Loading required package: S4Vectors
#> Warning: package 'S4Vectors' was built under R version 3.6.3
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following object is masked from 'package:randomForest':
#> 
#>     combine
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#>     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#>     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#>     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#>     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#>     union, unique, unsplit, which, which.max, which.min
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> Loading required package: MultiAssayExperiment
#> Warning: package 'MultiAssayExperiment' was built under R version 3.6.3
#> Loading required package: SummarizedExperiment
#> Loading required package: GenomicRanges
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Warning: package 'GenomeInfoDb' was built under R version 3.6.3
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: DelayedArray
#> Warning: package 'DelayedArray' was built under R version 3.6.3
#> Loading required package: matrixStats
#> 
#> Attaching package: 'matrixStats'
#> The following objects are masked from 'package:Biobase':
#> 
#>     anyMissing, rowMedians
#> Loading required package: BiocParallel
#> 
#> Attaching package: 'DelayedArray'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
#> The following objects are masked from 'package:base':
#> 
#>     aperm, apply, rowsum
#> 
#> Attaching package: 'ClassifyR'
#> The following objects are masked from 'package:Biobase':
#> 
#>     featureNames, sampleNames
## From help file
# Genes 76 to 100 have differential expression.
genesMatrix <- sapply(1:25, function(sample) c(rnorm(100, 9, 2)))
genesMatrix <- cbind(genesMatrix, sapply(1:25, function(sample)
  c(rnorm(75, 9, 2), rnorm(25, 14, 2))))
classes <- factor(rep(c("Poor", "Good"), each = 25))
colnames(genesMatrix) <- paste("Sample", 1:ncol(genesMatrix), sep = '')

## Problem if the features contain bad characters
rownames(genesMatrix) <- paste("Gene-", 1:nrow(genesMatrix), sep = '')

## Setting up runTests
trainParams <- TrainParams(randomForestTrainInterface, ntree = 500, getFeatures = forestFeatures)
predictParams <- PredictParams(randomForestPredictInterface)
params = list(trainParams, predictParams)

tmp = runTests(measurements = genesMatrix, 
               classes = classes, 
                     datasetName = "tmp",
                     classificationName = "tmp", 
                     permutations = 2, folds = 5,
                     params = params,
                     seed = 2020, verbose = 2)
#> Error: All cross-validations had an error.

^{Created on 2020-06-10 by the reprex package (v0.3.0)}

Class Imbalance Sampling

Classifier-specific methods such as class weights (.g. SVM, random forest) were found to work poorly and many people have mentioned the same on Stack Exhange. By default, enable upsampling to the number of samples of the largest class

Integrate Multi-class Discriminant Analysis Classifier

Sarah Romanes and John Ormerod implemented multi-class linear/quadratic discriminant analysis. Improve code so that it uses loops instead of copying and pasting the same command multiple times with different indices. Remove all representation of classes as 0 or 1 indicators (i.e. n = sum(vy1 + vy2 + vy3) can be n = length(samplesClasses)) and ensure the classifier works with a single factor variable. Convert end-user functions into S4 functions. Use :: notation for accessing functions from other packages and import them in NAMESPACE file. Also check for variables that are calculated but never used in any calculations such as Sx1 = t(mX)%*%(vy1+vy2+vy3) and Sx2 = t(mX2)%*%(vy1+vy2+vy3). Are they necessary for a correct calculation or are they left-over from an abandoned idea? Create documentation for all end-user facing functions.

Write a Developer's Guide for crossValidate and its Multiview Methods

With the large number of options, it becomes hard to remember what the expected inputs and outputs are. A manual that explains the parameter requirements, purpose of internal functions and shows some basic test cases will be helpful when editing the code in future.

performancePlot to autodetect performance for survival

It would be great not to have to

performancePlot(res, characteristicsList = list(x = "Assay Name"), performanceName = "C-index")

prevalTrainInterface Function Not Found Prevalication Error

Doing prevalidation with crossValidate stops with error about function prevalTrainInterface not being defined.

PCA as a transformParams

Auto detect for performancePlot x axis

Unclear Error if Data has Infinity Value

Should be caught before Error in fullResult$models : $ operator is invalid for atomic vectors. Reported by Farhan.

Classifier Parameter Choice Relies on Resubstitution Error Rate

Provide more alternatives.

Nested cross-validation.
Train, validation, test sets sample split.

Create Wrappers for Classifiers

Add new wrapper functions for DLDA, SVM, Random Forest, Logistic Regression, and Elastic-Net Regularised GLM classification algorithms.

Survival Data Subscript Out of Bounds Error

The code below is producing an error for Yunwei.

result <- crossValidate(temp_data, c("time", "status"), classifier = "randomSurvivalForest", nFolds = 5,
                        nRepeats = 3, performanceType = "C-index", selectionMethod = "CoxPH")

No error message for NA values in data

Need an error message for scenarios where input data contains NA values

Reformat Wide Documentation Code

Some usage sections and examples in the Rd files have a long line length and exceed the page width. Also, the vignette page width option is being ignored. Ensure that it is being compiled with R Markdown 2 and also wrap the lines nicely.

Issues when only 1 feature is inputted in crossValidate wrapper

I've created an example using the iris data. The example shows that when the measurement assays only contain one feature within them, crossValidate wrapper returns an error. There is a different error when multiViewMethod = "merge" and when no multiview method is used.

Setting up the data

irisFiltered = iris %>% filter(Species %in% c("setosa", "versicolor"))

rownames(irisFiltered) = make.names(rownames(irisFiltered))

measurementsData = irisFiltered %>% select(-Species) %>% apply(FUN = function(x) x, ., MARGIN = 2, simplify = FALSE) %>% lapply(data.frame)

outcome = irisFiltered$Species names(outcome) = rownames(irisFiltered)

CrossValidate issue without merge multiview

cvResults = ClassifyR::crossValidate(measurements = measurementsData, outcome = outcome, nFeatures = 1, selectionMethod = 't-test', classifier = "GLM", nRepeats = 1, nFolds = 3, nCores = 1)

CrossValidate issue with merge multiview

cvResults = ClassifyR::crossValidate(measurements = measurementsData, outcome = outcome, nFeatures = 1, selectionMethod = 't-test', classifier = "GLM", nRepeats = 1, nFolds = 3, nCores = 1, multiViewMethod = "merge")

interfacePCA

This broke when variables had zero variability. I need to add some filtering here in a sensible way.

Train function

If would be good to have a train() function to complement predict() for situations where people don't need to run crossValidate.

Effectively this would take most of the same input as crossValidate() and then do a runTest.

One key distinction from crossValidate would be that for multiViewMethod is only fits one model. So by default this should be the full model unless they specify a assayCombination.

Auto detect for ROCplot colours

It would be great to auto select the colours for ROCplot so the default wasn't Classifier Name.

Perhaps if comparison is not provided, and there are no characteristicsLabel, it could automatically paste the non-unique Characteristics values?

Prevalidation Doesn't Record Chosen Features for More Than Two Assays

> head(chosenFeatureNames(coxPrevalPredicts[[50]]))
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
[[6]]
NULL

Make CoxPH the default selection method for suvival

for crossValidate

Implement Two-stage Classifier

Recently, a two-stage classifier which combines clinical data with omics data was shown to reduce classification error rates greatly. Implement the idea in a generalisable way that works on mixed clinical data types and with user-specified feature selection and classification.

Wrapper Vignette Should Access Predicted Classes Differently

The code

cbind(predictions(classified), known = actualClasses(classified))

should be

cbind(predictions(classified)[[1]], known = actualClasses(classified))

to give nice output. Change in both stable and experimental versions.

AUC Only After Merging All Folds

AUC can be calculated by merging all predictions from all iterations, as is done currently, but for large sample sizse data sets, it would be slightly better to calculate the AUC at each fold separately and then average the values. Also, move the AUC performance calculation code from ROCplot.R into calcPerformance.R.

classifier = "CoxNet"
selectionMethod = "none"

output = train(
  x = DataFrame(measurements),
  outcome = outcome,
  classifier = classifier,
  selectionMethod  = selectionMethod,
)