Git Product home page Git Product logo

classifyr's People

Contributors

darios avatar dtenenba avatar ellispatrick avatar hpages avatar jwokaty avatar nick-robo avatar nturaga avatar vobencha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

classifyr's Issues

Change Default Colour Scheme

The blue and red colour scheme seems unpleasant and glaring. Change the default values of plotFeatureClasses and performancePlot to a more soothing palette.

Also, change default plot background to white with grey grid lines instead of the dreary grey, which is ggplot2's default.

colname error

A better handling of Error in !all.equal(colnames(measurements), make.names(colnames(measurements))) :
invalid argument type

Class list for CoxNet models, affecting the predict function.

When training a coxNet model using using the classifyR train function. The resulting model has the following class "trainedByClassifyR", "coxnet", "glmnet" . However when predicting using this model with the predict function I get the following error Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 't': 'data' must be of a vector type, was 'NULL'.

Removing "glmnet" from the class list seems to solve this issue. As shown below:

output = train(
  x = DataFrame(measurements),
  outcome = outcome,
  classifier = "CoxNet",
  selectionMethod  = "CoxPH",
  nFeatures = 15
)

class(output) = class(output)[1:2]

predict(output, measurements)

Replace ExpressionSet Data Storage by MultiAssayExperiment

This will allow classification of datasets consisting of multiple assay types (e.g. mRNA, protein) but also clinical datasets which have mixed data types in the same table. Decisions required regarding how to combine multiple assays into a single table that can be processed by existing classifiers available on CRAN.

Cross validate doesn't match up assay row names to names of outcome vector.

Hello, I've noticed if the rownames of the assay are not ordered the same as the names in the outcome vector, then ClassifyR will classify samples with the wrong outcome. This is particularly prominent when numeric row names are of type character where the ordering might be "1", "10", "2", "3", rather than 1, 2, 3, 10.

Tibbles

I heard someone had an issue with tibbles. I don't have an example though.... Just putting this here to remind me to follow up.

First Cross-validation Example in Vignette Fails on Some Versions of MacOS

Some of the summer / international exchange students are reporting problems and the errors only happen for people with MacOS and it seems to be version-specific. I don't have access to Apple computers myself.

  • Christine has MacOS Monterey 12.6 and doesn't have the error.
  • Cabiria has MacOS Monterey 12.6.2 and has the error.
  • Ian has MacOS Ventura 13.0.1 and has the error. Connecting to his department's Linux server in Hong Kong is problem-free.

Also, the Bioconductor check server merida1 uses MacOS Mojave 10.14.6 and doesn't have the error. Is there a modification to avoid it?

Pull out tuning results

I'd love to be able to generate a plot communicating why I should choose a certain number of features in my model.

ie performance vs n .

This could be used to justify using smaller models.

Integrate crissCrossValidate

Complete implementation of crissCrossValidate prototyped by Harry using selectMulti, train and predict functions. Harmonise the parameter naming with existing functions and test with example data.

Model Comparison Feature

Hi,
Would it be possible to have a feature in ClassifyR that runs many models and ranks their results based on a specified metric such as AUC or Balanced Accuracy? This would be extremely useful for those wanting to compare different models without having to run them each individually. This is similar to the AutoML feature in the H2o package. I've included an example of its output below.

image

Data formatting for crossValidate

measurements <- list(clinical = clin[useImages,],
prop = cellProp[useImages,],
dist = pairDist[useImages,])

measurements <- lapply(measurements, as.data.frame)
measurements <- lapply(measurements, function(x){rownames(x) <- paste0("sample",rownames(x));x})

It would be nice if we could be a bit more flexible in all data objects in a list needing to be the same class. I'm not a fan of character numeric rownames, but arguably we could be more flexible here too...

Replace Performance Metrics by Intel DAAL Metrics

The current list of metrics seems impressive, but there are not as many it first appears. For example, fall (Fallout) is the same as fpr (False Positive Rate) and miss is the same as fnr (False Negative Rate). Also, these metrics are limited to two-class predictions and depend on the R package ROCR which is no longer being maintained. Intel is developing a commercially available Data Analytics Acceleration Library with a set of metrics for multi-class classification tasks. ClassifyR should provide an R version of this.

Classifier Characteristics Limited to Dataset and Classification Name

The only annotations about classification which can be provided are the dataset name and classification name and get stored as slots in ClassifyResult object. Instead, allow a user to provide an aribtrary two-column data frame to runTests listing any characteristics and allow plotting functions to group by any kind of characteristic.

Classifier Return Type Default is Categorical Class

Often, users will want to make a ROC plot and they only have class predictions by default. It's not noticeably more resource-intensive to store the scores and allow seamless AUC calculations, so change the default of returnType to "both" for prediction wrappers.

Dealing with ill-formated characters

Hi Dario,
I noticed that when there are hyphens or other unexpected characters in the rownames of the input matrix, there is always an error. As the message isn't quite informative, it becomes hard for the users to diagnose what went wrong. One possible solution is perhaps put in an informative error message whenever this happens?

Cheers, Kevin.

library(randomForest)
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
library(ClassifyR)
#> Loading required package: S4Vectors
#> Warning: package 'S4Vectors' was built under R version 3.6.3
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following object is masked from 'package:randomForest':
#> 
#>     combine
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#>     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#>     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#>     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#>     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#>     union, unique, unsplit, which, which.max, which.min
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> Loading required package: MultiAssayExperiment
#> Warning: package 'MultiAssayExperiment' was built under R version 3.6.3
#> Loading required package: SummarizedExperiment
#> Loading required package: GenomicRanges
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Warning: package 'GenomeInfoDb' was built under R version 3.6.3
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: DelayedArray
#> Warning: package 'DelayedArray' was built under R version 3.6.3
#> Loading required package: matrixStats
#> 
#> Attaching package: 'matrixStats'
#> The following objects are masked from 'package:Biobase':
#> 
#>     anyMissing, rowMedians
#> Loading required package: BiocParallel
#> 
#> Attaching package: 'DelayedArray'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
#> The following objects are masked from 'package:base':
#> 
#>     aperm, apply, rowsum
#> 
#> Attaching package: 'ClassifyR'
#> The following objects are masked from 'package:Biobase':
#> 
#>     featureNames, sampleNames
## From help file
# Genes 76 to 100 have differential expression.
genesMatrix <- sapply(1:25, function(sample) c(rnorm(100, 9, 2)))
genesMatrix <- cbind(genesMatrix, sapply(1:25, function(sample)
  c(rnorm(75, 9, 2), rnorm(25, 14, 2))))
classes <- factor(rep(c("Poor", "Good"), each = 25))
colnames(genesMatrix) <- paste("Sample", 1:ncol(genesMatrix), sep = '')

## Problem if the features contain bad characters
rownames(genesMatrix) <- paste("Gene-", 1:nrow(genesMatrix), sep = '')

## Setting up runTests
trainParams <- TrainParams(randomForestTrainInterface, ntree = 500, getFeatures = forestFeatures)
predictParams <- PredictParams(randomForestPredictInterface)
params = list(trainParams, predictParams)

tmp = runTests(measurements = genesMatrix, 
               classes = classes, 
                     datasetName = "tmp",
                     classificationName = "tmp", 
                     permutations = 2, folds = 5,
                     params = params,
                     seed = 2020, verbose = 2)
#> Error: All cross-validations had an error.

Created on 2020-06-10 by the reprex package (v0.3.0)

Class Imbalance Sampling

Classifier-specific methods such as class weights (.g. SVM, random forest) were found to work poorly and many people have mentioned the same on Stack Exhange. By default, enable upsampling to the number of samples of the largest class

Integrate Multi-class Discriminant Analysis Classifier

Sarah Romanes and John Ormerod implemented multi-class linear/quadratic discriminant analysis. Improve code so that it uses loops instead of copying and pasting the same command multiple times with different indices. Remove all representation of classes as 0 or 1 indicators (i.e. n = sum(vy1 + vy2 + vy3) can be n = length(samplesClasses)) and ensure the classifier works with a single factor variable. Convert end-user functions into S4 functions. Use :: notation for accessing functions from other packages and import them in NAMESPACE file. Also check for variables that are calculated but never used in any calculations such as Sx1 = t(mX)%*%(vy1+vy2+vy3) and Sx2 = t(mX2)%*%(vy1+vy2+vy3). Are they necessary for a correct calculation or are they left-over from an abandoned idea? Create documentation for all end-user facing functions.

Create Wrappers for Classifiers

Add new wrapper functions for DLDA, SVM, Random Forest, Logistic Regression, and Elastic-Net Regularised GLM classification algorithms.

Survival Data Subscript Out of Bounds Error

The code below is producing an error for Yunwei.

result <- crossValidate(temp_data, c("time", "status"), classifier = "randomSurvivalForest", nFolds = 5,
                        nRepeats = 3, performanceType = "C-index", selectionMethod = "CoxPH")

Reformat Wide Documentation Code

Some usage sections and examples in the Rd files have a long line length and exceed the page width. Also, the vignette page width option is being ignored. Ensure that it is being compiled with R Markdown 2 and also wrap the lines nicely.

Issues when only 1 feature is inputted in crossValidate wrapper

I've created an example using the iris data. The example shows that when the measurement assays only contain one feature within them, crossValidate wrapper returns an error. There is a different error when multiViewMethod = "merge" and when no multiview method is used.

Setting up the data

irisFiltered = iris %>% filter(Species %in% c("setosa", "versicolor"))

rownames(irisFiltered) = make.names(rownames(irisFiltered))

measurementsData = irisFiltered %>% select(-Species) %>% apply(FUN = function(x) x, ., MARGIN = 2, simplify = FALSE) %>% lapply(data.frame)

outcome = irisFiltered$Species names(outcome) = rownames(irisFiltered)

CrossValidate issue without merge multiview

cvResults = ClassifyR::crossValidate(measurements = measurementsData, outcome = outcome, nFeatures = 1, selectionMethod = 't-test', classifier = "GLM", nRepeats = 1, nFolds = 3, nCores = 1)

CrossValidate issue with merge multiview

cvResults = ClassifyR::crossValidate(measurements = measurementsData, outcome = outcome, nFeatures = 1, selectionMethod = 't-test', classifier = "GLM", nRepeats = 1, nFolds = 3, nCores = 1, multiViewMethod = "merge")

interfacePCA

This broke when variables had zero variability. I need to add some filtering here in a sensible way.

Train function

If would be good to have a train() function to complement predict() for situations where people don't need to run crossValidate.

Effectively this would take most of the same input as crossValidate() and then do a runTest.

One key distinction from crossValidate would be that for multiViewMethod is only fits one model. So by default this should be the full model unless they specify a assayCombination.

Auto detect for ROCplot colours

It would be great to auto select the colours for ROCplot so the default wasn't Classifier Name.

Perhaps if comparison is not provided, and there are no characteristicsLabel, it could automatically paste the non-unique Characteristics values?

Implement Two-stage Classifier

Recently, a two-stage classifier which combines clinical data with omics data was shown to reduce classification error rates greatly. Implement the idea in a generalisable way that works on mixed clinical data types and with user-specified feature selection and classification.

AUC Only After Merging All Folds

AUC can be calculated by merging all predictions from all iterations, as is done currently, but for large sample sizse data sets, it would be slightly better to calculate the AUC at each fold separately and then average the values. Also, move the AUC performance calculation code from ROCplot.R into calcPerformance.R.

Train function doesn't allow selectionMethod = "none" with coxNet.

Running these lines of code spits out the following error:
Error in .doSelection(measurementsUse, outcomeTrain, CrossValParams(), : trying to get slot "tuneParams" from an object of a basic class ("NULL") with no slots

classifier = "CoxNet"
selectionMethod = "none"

output = train(
  x = DataFrame(measurements),
  outcome = outcome,
  classifier = classifier,
  selectionMethod  = selectionMethod,
)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.