mayer79 / missranger Goto Github PK

View Code? Open in Web Editor NEW

57.0 9.0 11.0 11.83 MB

R package "missRanger" for fast imputation of missing values by random forests.

Home Page: https://mayer79.github.io/missRanger/

License: GNU General Public License v2.0

R 94.76% TeX 5.24%

r imputation random-forest rstats machine-learning missing-values

missranger's Introduction

{missRanger}

Overview

{missRanger} uses the {ranger} package to do fast missing value imputation by chained random forest. As such, it serves as an alternative implementation of the beautiful 'MissForest' algorithm, see vignette.

The main function missRanger() offers the option to combine random forest imputation with predictive mean matching. This firstly avoids the generation of values not present in the original data (like a value 0.3334 in a 0-1 coded variable). Secondly, this step tends to raise the variance in the resulting conditional distributions to a realistic level, a crucial element to apply multiple imputation frameworks.

Installation

# From CRAN
install.packages("missRanger")

# Development version
devtools::install_github("mayer79/missRanger")

Usage

We first generate a data set with about 10% missing values in each column. Then those gaps are filled by missRanger(). In the end, the resulting data frame is displayed.

library(missRanger)
 
# Generate data with missing values in all columns
irisWithNA <- generateNA(iris, seed = 347)
 
# Impute missing values
irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100)
 
# Check results
head(irisImputed)
head(irisWithNA)
head(iris)

# Replace random forest by extremely randomized trees
irisImputed_et <- missRanger(
  irisWithNA, 
  pmm.k = 3, 
  splitrule = "extratrees", 
  num.trees = 100
)

# Using the pipe...
iris |> 
  generateNA() |> 
  missRanger(pmm.k = 5, verbose = 0) |> 
  head()
  
# More infos via `data_only = FALSE`
imp <- missRanger(irisWithNA, pmm.k = 3, data_only = FALSE, seed = 3)
summary(imp)

# missRanger object. Extract imputed data via $data
# - best iteration: 3 
# - best average OOB imputation error: 0.02058243 
# 
# Sequence of OOB prediction errors:
# 
#      Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# [1,]   1.00000000  1.03868004  0.267209559 0.103679645 0.08148148
# [2,]   0.02948771  0.05997235  0.005676231 0.007813704 0.00000000
# [3,]   0.02709505  0.06268752  0.004921649 0.008207934 0.00000000
# [4,]   0.02673459  0.06504868  0.005183209 0.008761418 0.00000000
# 
# Corresponding means:
# [1] 0.49821014 0.02059000 0.02058243 0.02114558
# 
# First rows of imputed data:
# 
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.6         0.2  setosa

Check out the vignettes for more info.

missranger's People

Contributors

Stargazers

Watchers

Forkers

thierrygosselin mutual-ai andland kvantas andrewcstewart dhbrand minghao2016 bgall erickkill pdwaggoner haghish

missranger's Issues

How to save the trained Random Forest model and use it to impute new data set?

Hi, appreciate if you can let me know a little bit on how to use missRanger to develop a Random Forest model on a training data set and use it on test data set? Thank you very much!

Why get "out of bag" error over 1?

Hi,

Thx for providing such a nice package for the imputation.

I have a question about the "OOB" error. When conducting multiple imputations on a data frame with (nrow = 4,320, ncol = 596), I got many "OOB" over 1 (max = 1.04). The overall missing rate is 64.61%, But max missing rate for each column is 10.69%. Could someone help me to figure out why the OOB error is over 1.

thx for your help.

Multiple imputation via bootstrapping rather than predictive mean matching

missRanger contains the pmm.k argument to allow users to add more variability to their imputed values and obtain imputed values drawn from observed values. However, variability is already built into the random forests model by (a) generating bootstrapped "out of bag" data, (b) drawing features at random, and (c) relying on many forests. Of course, you might underestimate variability and choose to derive multiple imputed datasets. However, the documentation advises the use of predictive mean matching to add variability across each imputed dataset.

It would be worthwhile acknowledging in the documentation perhaps that this, too, will understate variability because it assumes the donor pool is random, when really all of your data are random. An approach that (a) eliminates the need to use predictive mean matching (and tune the number of nearest neighbors selected), (b) is more theoretically motivated, and (c) is more robust to "false convergence" by adding further variability to the initialization values in the chained equation is to bootstrap your entire dataset (sample rows with replacement) for each imputation you want and then run missRanger() on the bootstrapped datasets. As far as I can tell, this isn't any more computationally complex and is strictly better.

The one thing to be aware of then is that your random number seed needs to be declared prior to bootstrapping the data. You can then use the same random seed for each full imputation via missRanger() since even the same seed will produce different values at that step since your data already are random.

Is there any parallelization option?

Hi, I was curious if there are any parallelization option since trees in random forest can be build independently unlike boost method. Are there any options within this library that would affect missRanger from being parallelize? Reason why I'm asking because missForest package have an option for it and code example for it was curious if there are any to speed up my imputation.

Thank you for this package

Consistent as.character(formula)==3L error

Hello,

In the last week, missRanger has been giving an error that "length(formula <- as.character(formula)) == 3L is not TRUE". This error is also coming up when I ran the vignette code to see if it was an issue with my data, but this error comes up every time:

irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100, verbose = 0)
irisImputed <- missRanger(irisWithNA, formula= .~.-Species, pmm.k = 3, num.trees = 100, verbose = 0)

Adding out-of-bag errors and convergence monitoring

Two enhancements would be very useful:

Add an argument to return the out-of-bag (OOB) prediction error (on a variable-by variable basis), a la missForest. If argument is T, could return a list with data in one element and error in others. Preferably variablewise. Right now only the average is returned.
Some way of monitoring convergence like mice's plots of the chained values.

Question on PMM

Hi Michael, my question is regarding PMM.

e.g. a dataset with 10000 variable with different level of missingness
Is their a potential for bias if PMM is carried out after the model for one variable ?

Since the knn will be only between that variable's values and not accounting all the variable.
If all variable were accounted for distance, neighbours would be different, I suppose...

and not related to missRanger, I see you've started working with LightGBM, have you tried imputations with it ?

Best
Thierry

Cannot impute categorical variables

I am trying to impute a dataset that contains categorical (factors) and numerical variables and that has about 10 % missing values in each variable. I have been able to impute my dataset with missForest, but I get the error "Error: is.vector(x) is not TRUE" when it comes to imputing my factor variable.

Is there a way to impute factors with missRanger?

Allow out-of-sample application

Hi,

I'm excited about the new keep_forests option. I was hoping to use it to train imputation forests on a training set and then use those models to impute on a test set. However, when I try, I get an error that I am missing data in other columns in the test set and therefore can't predict out the imputations for a given variable. Is there any way around that?

Note that I think in the documentation it says: "Only relevant when data_only = TRUE (and when forests are grown)." I think you meant FALSE.

Thanks!

Possibility to access Ranger object for Shap values

The project could be extended to cater for the production of Shap values and other metrics that the underlying package accounts for.
Is there a way to accomplish that in the current version already?

enhancement (something similar to randomForest::na.roughfix)

Hi Michael,

I don't know what's your target audience/users, but currently in missRanger::missRanger (line 67):

you're using fit <- ranger::ranger(stats::reformulate(completed, response = v) where completed requires columns to have no missing data. This behaviour for some dataset (e.g. genomic) can create huge bias by drastically reducing the number of available variables for training.

I suggest an enhancement similar to randomForest::na.roughfix where an additional argument would give user the possibility of quickly filling missing values of predictor/training set columns.

Also discussed here: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1

Missing value replacement for the training set
Random forests has two ways of replacing missing values. The first way is fast. If the mth variable is not categorical, the method computes the median of all values of this variable in class j, then it uses this value to replace all missing values of the mth variable in class j. If the mth variable is categorical, the replacement is the most frequent non-missing value in class j. These replacement values are called fills.
The second way of replacing missing values is computationally more expensive but has given better performance than the first, even with large amounts of missing data. It replaces missing values only in the training set. It begins by doing a rough and inaccurate filling in of the missing values. Then it does a forest run and computes proximities.
If x(m,n) is a missing continuous value, estimate its fill as an average over the non-missing values of the mth variables weighted by the proximities between the nth case and the non-missing value case. If it is a missing categorical variable, replace it by the most frequent non-missing value where frequency is weighted by proximity.
Now iterate-construct a forest again using these newly filled in values, find new fills and iterate again. Our experience is that 4-6 iterations are enough.

Cheers
Thierry

Undefined Columns Selected Error if Strange Variable Names Present

If some strange omics identifiers are some of the column names, the imputation fails.

> colnames(iris)[2] <- "IGHV3-43D;IGHV3-9"
> missRanger(iris)
Missing value imputation by random forests
Error in `[.data.frame`(data, , relevantVars[[1L]], drop = FALSE) : 
  undefined columns selected

Can you make it more bioinformatics-friendly?

Replace `cat` with `message`

Hi Mayer,

This is a great package, Thanks! I am attempting to use missRanger function within another, but some cat output seems to clutter the output and it is hard to get rid of them. Would message be a better option?

Random Forest Missing Data Algorithms

interesting paper you might like...
https://arxiv.org/abs/1701.05305

How can I change hyperparameters as in ranger?

Hi!

In ranger, I can tweak these arguments: num.trees, mtry, min.node.size, splitrule, and num.random.splits. How can I do so in missRanger?

Thank you very much!

missRanger as.character(formula) might fail with long formulas

Hi, thank you for maintaining this great and extremely useful package!

In missRanger in the stopifnot line 8 you use formula <- as.character(formula) to later construct the relevantVars out of it in line 19. I just noticed that this might fail, since from ?as.character we learn,

as.character breaks lines in language objects at 500 characters, and inserts newlines. Prior to 2.15.0 lines were truncated.

Example:

as.character(as.formula(paste(paste(rep('foo', 85), collapse=' + '), '~ .')))[[2]]

I suggest using all.vars(formula) or something similar instead (rather than fixing it with gsub('\\n ', '', x2)).

Cheers!

Parallel & progress bars

Hi there! I was wondering if there was any way to adapt your algorithm to work in parallel , similarly to missForest. Secondly, I was wondering if there is a way to have some kind of progress bar per iteration (as in percentual rather than the dots). Thanks for your work on this package :)

Cheers,
Joanna

Allow 'mtry'

Would you be interested in a patch to allow users to adjust mtry? It's a major tuning parameter for random forests, so I think it would help. What I have done is to add a mtry_rule argument, which is a function taking an vector of variable names as input and returning a value of mtry, called with union(v,completed). The current rule is roughly

function(x) floor(sqrt(length(x)-1)))

and I've tried others such as

function(x) min(5, length(x)-1)

Current use of OOB prediction in pmm

I was just having a cursory look at the missRanger algorithm, and I noticed this:

    data[v.na, v] <- if (pmm.k) pmm(xtrain = fit$predictions, 
                                    xtest = pred, 
                                    ytrain = data[[v]][!v.na], 
                                    k = pmm.k) else pred

The value of fit$predictions is the out-of-bag prediction - not the prediction of the forest. Is this the intended behaviour? The prediction of the random forest (in its entirety) can only be found via predict(fit, data=data[!v.na, union(v, completed))$predictions. Unfortunately, prediction is somewhat slow in ranger - if you have good reasons to use the out-of-bag predictions rather than the fitted forest predictions, then that'd be great.

Initial matrix prior to iterating

Hello-

Love the package, I was curious if there is a way to initialize the matrix to something other than predictive mean matching? For example, instead of using predictive mean matching, say using zero imputation, min imputation or an alternative method. I am working with left truncated data where missing data is likely very close to zero, so I wondered if there was a way to do this?

Question on missRanger and BRMS

Hi,

Thank you for a brilliant package. I'm using missRanger to impute, and then apply BRMS to the imputed dataset. BRMS describes how to use the mice package, but missRanger imputed data comes out quite different.

Ideally I would have imputed the data, pooled the data, run my models, run model comparisons. But I cannot then pool using mice, it doesn't work. So instead I run multiple models on imputed data like this:

models_imputed <- brm_multiple(formula = score ~ 1 + cs(group), data = imputed, family = acat("cloglog"), combine=TRUE, chains=1)
But this is pretty clunky, and if I try to do a LOO on my models (I have 5) I get the error:
Using only the first imputed data set. Please interpret the results with caution until a more principled approach has been implemented.

This isn't an issue with missRanger as such, more that I'm caught in the space between missRanger and BRMS and am not sure how to get them to work together...hoping someone might have advice!

Thanks

More granular control over which cells get imputed

I would love to see a feature whereby I can feed a logical matrix of the same dimensions as the underlying data.frame into the function, which controls which cells in the data.frame get imputed, and which do not.

I currently have a data.frame which has two types of NAs: (i) data which I need to impute, and (ii) data which I know should never exist.

This situation arrises when trying to impute unbalanced panel data (e.g. annual income of a population of individuals). Since I reshape this data to be "wide" (one row per person) I end up with a number of columns (e.g. income_2010, income_2011, ... etc). This is essential to capture time-dynamics (i.e. my income this year and next year are strongly correlated).

Some for a person who died in 2005, I do not wish to impute income_2006, income_2007, etc. But for someone who's income is missing during their lifetime, I would like to impute it.

All the best - and thanks for a great package!

allow syntactically wrong colum names like "bad name"

See title. Will wait for release 0.12 of ranger as its x/y interface will change.

add unit tests

Before the next CRAN release, we should add unit tests for all exported functions.

Add OOB accuracy to screen

At the moment, the user has no idea of the OOB accuracy of the forests. Maybe we can add a verbose = 1 option like this:

0: print nothing
1: current behaviour (one line per iteration, one dot per variable)
2: instead of printing dots as now, would print OOB accuracy. Might be a bit strange if both numeric and categorical variables are involved.

Stack Overflow Error

I have a gene expression matrix with only two missing values. it causes a stack overflow error. I did mean imputation instead.

> dim(RNAarrays)
  385 24174
> table(is.na(RNAarrays))
    FALSE    TRUE 
  9306988       2 
> RNAarrays <- missRanger(as.data.frame(RNAarrays))
  Missing value imputation by random forests
  Error: protect(): protection stack overflow

Return OOB accuracy

Hi,
Thanks for the work on this package. I can't open a branch to submit a PR, but what do you think of the changes here:
https://gist.github.com/markgrujic/a46a4466b618164e77e0abda14d97590

Note a new argument called returnOOB and the associated code from line 80 that returns a list of the imputed data and OOB accuracy if required.
Cheers

Interface to specify variables to use/impute/ignore.

missRanger uses all factors and numeric variables in the fitting process. This is annoying if there are id variables etc. that we want to ignore during imputation. Similarly, there might be situations where there is a fixed set of variables X used to fill missing values and a set of variables Y that should be imputed. What is the neatest way to pass this info to missRanger?

Suggestion:

impute = NULL (or character vector): Variables to impute. If NULL, use all.
impute_by = NULL (or character vector): Variables used to impute "impute". If NULL, use all.
impute_ignore: NULL (or character vector): Variables not to impute.
impute_by_ignore = NULL (or character vector): Variables not used to impute.

Any input is very welcome.

Matrix columns in data frame generate error in missRanger

Hi -

I've been a big fan/user of this package for a while, so I thought I would contribute a bug report on an issue I just encountered. Essentially, if one of the columns in a data frame is not a vector, the code will error without explanation because the column names don't transfer into a matrix of missing values (dataNA). It appears to happen at this line of code:

dataNA <- is.na(data[, visitSeq, drop = FALSE])

Some reproducible code is below:

# reproducible example

library(missRanger)

irisWithNA <- generateNA(iris, seed = 34)

# scale some variables

irisWithNA$Sepal.Length <- scale(irisWithNA$Sepal.Length)

class(irisWithNA$Sepal.Length)

try(missRanger(irisWithNA, pmm.k = 3, num.trees = 100))

# convert back to vector

irisWithNA$Sepal.Length <- c(scale(irisWithNA$Sepal.Length))

missRanger(irisWithNA, pmm.k = 3, num.trees = 100)

It's not really a bug per se, just either some missing pre-processing or a need for an informative error message. I feel like it would be worthwhile to address given how common the scale function is in R.

normal behaviour of `missRanger` compared with `randomForestSRC`

Hi Michael,

I gave missRanger a try, using genomic dataset with lots of missing genotypes (RADseq).
Could you tell me why randomForestSRC is able to impute the data below, but not missRanger ?

INDIVIDUALS	GENOTYPE
1	001001
2	003003
3	001001
4	003003
5	003003
6	003001
7	003003
8	NA
9	001001
10	001001

I know imputing this would be unreliable, but apart from this, what's the solution if a complete dataset is required for an analysis ?

Best regards
Thierry

Inheriting Global Seed

Is there a way to have the function inherit the global seed value? Otherwise you have to specify the seed at each function call in order to make the work reproducible.

Best,
Jamey

question on missRanger

Hi,
very nice package. I love ranger package as well so very good idea to use it for imputation!
I have 4 questions actually:

I was just wondering if it's possible to impute data based on some subset of rows like in cross validation settings or in my case in a time dynamic setup.

E.g. I have time series data and I would like to impute values using only past data, is there a better way than calling missRanger repetitively at each time point subsetting on the past?

Even in that case, should I impute the next day using the raw past data or using the previously completed data?
Similarly if I have a train and test dataset is there a way to apply the rules of the train dataset imputation to the test one without rerunning the algorithm? (can missRanger return the imputation model?)
Do you have recommendations about when using extratrees splitrule? Is it better and if so in which cases?

Thanks!

Question

Quick question Michael...

Scenario where you have more than 1 response variable missing:

e.g. with the iris dataset
let say Sepal.Length and Sepal.Width are missing
we know that both of these values are correlated together with Species.

Your implementation imputes by column, is the correlation between columns is still accounted for in the model ? Because, we don't want to have imputed values that taken together after imputations don't "fit" the species...

Best,
Thierry

How to test the accuracy of predictions?

I am rather new to random forests and especially imputations and I would like to know how can I get an estimate of the accuracy of these predictions?

For example, if I have a matrix with species in columns and abundance in rows and se generateNA to create NA values for imputation. How can I test the accuracy of the predicted and actual observed values?

For example, I have the following dataset and want to test how well the imputation works against the abundance of these species over these selected two years, relative to the actual values:

#data
# A tibble: 12 x 7
   eventDate   year Hirundorustica Himantopushimantopus Gallinulachloropus Fulicaatra Spilopeliachinensis
   <date>     <dbl>          <int>                <int>              <int>      <int>               <int>
 1 2019-01-01  2019         375087               275213             337709    1638522               81054
 2 2019-02-01  2019         245500               174385             230240     864141               72817
 3 2019-03-01  2019         478287               169552             207516     509113               59389
 4 2019-04-01  2019        1149118               146255             162036     371454               58119
 5 2019-05-01  2019        1777995                84937             132554     290331               43462
 6 2019-06-01  2019         674044                52308             101186     255249               24479
 7 2019-07-01  2019        1114053                75779             107368     377148               23558
 8 2019-08-01  2019        2091425                81571             133904     535402               31321
 9 2019-09-01  2019        1834696               105622             141551     659778               46775
10 2019-10-01  2019         676342               111289             174135     737695               76354
11 2019-11-01  2019         322766               143620             302165     869143               63237
12 2019-12-01  2019         359126               193387             281926    1299738               66995

#Generate NAs
mimNA[, -c(1, 2)]<- generateNA(mimNA[, -c(1, 2)], seed=5)

#then create imputations
mRan <- missRanger(mimNA[, -c(1, 2)], pmm.k=3, num.trees = 100)