nhejazi / biotmle Goto Github PK

:package: :microscope: R/biotmle: Targeted Learning with Moderated Statistics for Biomarker Discovery

Home Page: https://code.nimahejazi.org/biotmle/

License: Other

R 84.96% Makefile 1.95% TeX 13.10%

bioinformatics biostatistics bioconductor statistics machine-learning causal-inference r bioconductor-packages bioconductor-package targeted-learning

biotmle's Introduction

hello, i'm Nima

I'm an academic (bio)statistician working at the interface of causal inference, machine learning, and non- and semi-parametric statistics. I'm passionate about building open-source software tools to improve the accessibility of modern, model-agnostic and assumption-lean methods for statistical inference and causal machine learning, and I'm especially excited by the applications of statistics to the biomedical and public health sciences.

Are you looking for open source software for targeted causal machine learning? Maybe you should check out the tlverse project and browse our free open-source handbook!

biotmle's People

Contributors

Stargazers

Watchers

Forkers

guhjy masedki

biotmle's Issues

BioC 3.6 Package Maintenance

After this package was accepted in Bioconductor, the copy made available through their repository was switched over from using git to svn for version control.

As development is taking place on the version hosted on GitHub, it's important to maintain some form of synchronicity between the git and svn histories. Some resources on this:

Bioconductor guide to GitHub+SVN

revdep check on wesanderson

👋 @nhejazi
Just wanted to check if you're having any build issues with this package. I find that wesanderson fails on a revdep check here. But I suspect it was an issue earlier from using Zissou instead of Zissou1.

I see that this has been replaced with:

biotmle/R/plots.R

Line 107 in 9349978

pal1 <- wesanderson::wes_palette("Royal2", 100, type = "continuous")

If this is correct, I'll ignore the failed check and proceed. Please advise.

Current error is:

   > heatmap_ic(x = limmaTMLEout, design = design, FDRcutoff = 0.05,
    +            top = 15)
    Error in wes_palette("Zissou", 100, type = "continuous") : 
      Palette not found.
    Calls: heatmap_ic -> wes_palette
    Execution halted

Thanks for your help 🙏

`SummarizedExperiment` row names don't propagate to `modtest_ic()` output

biotmle/R/eif_moderated.R

Lines 73 to 88 in 7d720af

 tt <- limma::topTable( 

 fit = fit, 

 coef = 1, 

 number = Inf, 

 adjust.method = adjust, 

 sort.by = "none", 

 ... 

 ) %>% 

 tibble::rownames_to_column(var = "ID") %>% 

 tibble::as_tibble() %>% 

 mutate( 

 # remove log-fold change and overwrite average expression with ATE 

 logFC = NULL, 

 AveExpr = biotmle@ateOut, 

 var_bayes = fit$s2.post / n_obs, 

 )

The row names from the original SummarizedExperiment object fed to to biomarkertmle() don't appear as row names in the topTable slot of the modtest_ic() output. I think that this can be fixed by removing line 81 from above, and adding ID = rownames(biotmle) to the mutate() call.

Allow specification of functional forms for g and Q fits

The tmle package allows for the functional form of the nuisance parameters g and Q to be specified via the optional arguments gform and Qform, respectively. These arguments simply take in a character that is then translated to a formula for the glm function (via as.formula) --- e.g., gform = "A ~ W1 + W2". Since the machinery used for point estimation in this package relies on the tmle package (though the variance estimator is different here), an option should be added to allow gform and Qform (and, indeed, other arbitrary arguments) to be passed to the call to tmle in biomarkertmle_exposure. This would be most simply done via the addition of a ... argument.

Bioc Check Failure

Current (v3.10) Bioconductor package checks are failing on all platforms. See https://bioconductor.org/checkResults/devel/bioc-LATEST/biotmle. This needs to be resolved as soon as possible in order to avoid package deprecation in the next release (v3.11).

Parallelization via future

The future package provides facilities for registering backends for parallelization across systems with diverse types of schedulers (e.g., SLURM vs. SNOW, etc.). To make the parallelization machinery within this package easier to use, it is likely worth moving away from the use of foreach and over to future.

This directly conflicts the proposal in #18. Only one of these may be implemented.

Error in superheat:

Hello, I'm following your excellent biomarker pipeline and ran across an error with the heatmap/dendrogram command (see below).

heatmap_ic(x = limmaTMLEout,
+            row.dendrogram = TRUE,
+            clustering.method = "hierarchical",
+            design = design, FDRcutoff = 1.0, top = 10)
**Error in superheat::superheat(as.matrix(biomarkerTMLEout_top), row.dendrogram = TRUE,  : 
  formal argument "row.dendrogram" matched by multiple actual arguments**

I installed the superheat package so I'm not sure where the error is coming from?

Testing IC estimates

use only intercept in design matrix; otherwise, unclear what the t-test is actually doing...
using more than just the intercept term after applying the IC transform would result in a test that is not necessarily well-defined

preliminary R package

devtools::install_github("nhejazi/biotmle", subdir = "pkg", ref = "develop")

@ahubb40 @wilsoncai1992 A preliminary version of the TMLE-Limma R package is now available and can be installed using devtools with the command listed above.

example Illumina 2007 data set

look at .RData file in BenzeneIllumina2007 / Illumina Data 2007 / R subdirectory on Dropbox
note that the data set includes repeated measures, which should be dealt with appropriately
use this data set for testing the structure and current code associated with this package

appveyor build exits with code 5

Using the latest version of a GitHub package that isn't owned by my account causes a failure -- in particular, it appears that I am restricted to the CRAN version of Karthik's wesanderson color palette.

See the build log on Appveyor -- specifically, v1.0.101 -- for details. Package installation fails with exit code 5. It's unclear what this means or why this is happening, given that the Travis-CI build is successful.

CIs based on tail bounds

We should implement several alternative methods for inference, not drawn directly from the normal approximation to the asymptotic distribution of influence functions. At least two classes of options are available

Use of a logistic distribution (or other heavy-tailed distribution) as reference instead of normal/t (post-hoc)
Bernstein, Bennett, or Hoeffding bounds for influence function-based estimators, as per https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2827893/

Switch heatmap to use superheat?

Based on comments from the in-progress Bioconductor review, the NMF package seems to have been broken by recent updates in one of its dependencies (namely, grid). Considering that superheat will soon be on CRAN, it would be nice to use this (better) tool rather than the heatmap implementation (that was, admittedly) thrown together using NMF.

Outcomes method for biomarkers

It appears that the idea of considering the impact of biomarkers on some kind of outcome measures was rather ill-conceived -- that is, with the current use of tmle::tmle for assessing the impact of biomarkers (on some kind of outcome via the ATE) requires that the expression measures associated with the biomarkers be discretized.

Generally speaking, there does not exist a reasonable way to discretize expression measures -- there is great disagreement in the genomics / high-dimensional biology literature -- and any estimates based on discretized biomarker expression values are likely to be un-meaningful scientifically.

TODO for Bioconductor

To-do list based on comments from the in-progress Bioconductor review:

switch illuminaData to use the SummarizedExperiment package and class
change biomarkertmle to accept input as objects of class SummarizedExperiment
re-write the biotmle class to simply exist as a subclass of SummarizedExperiment
explicitly check the "type" argument when first calling the biomarkertmle function
re-write heatmap function to make use of the SuperHeat package (see this issue)

TODO

Finish multilevel TMLE-LIMMA algorithm
collect data + try the method
include satisfactory data into package
completion/submission of statistical manuscript
submission to JOSS

What is the point of modtestOut?

The following error is produced by line 56 in modtest_ic:
cannot coerce class ""qr"" to a data.frame

The line in question attempts to change an MArrayLM object (produced by limma::eBayes to a data.frame via biotmle@modtestOut <- as.data.frame(fit).

Importantly, it is unclear what utility is gained by even saving the objected produced by limma::eBayes, as all inferential information is stored in the output of limma::topTable

Parallelization via BiocParallel

The original (and current) implementation of this technique used the "foreach" construct to perform parallel computation to assess the large numbers of biomarkers that are generally dealt with. While support for "foreach"-style computation should be kept intact, it would likely be worth adding a flag to the main functions, to allow the (optional) use of BiocParallel.

Simultaneous CIs

We should implement several approaches for generating simultaneous confidence intervals covering the ATE across the full set of specified probes. A few ideas:

Bonferroni CIs: \hat{ATE}g \pm Z{1 - 0.05/(2p)} SE_g, where SE_g is the standard error based on the pooled variance estimate
Multivariate Normal CIs: an approach based on the variance-covariance matrix of the EIF as implemented in https://github.com/tlverse/tmle3/blob/master/R/utils.R#L55-L57

BioC 3.8 - BiocCheck

A current run of BiocCheck::BiocCheck(".") returns a few interesting NOTEs, given below

"Avoid sapply(); use vapply() found in files:"
"Use accessors; don't access S4 class slots via '@' in examples/vignettes."

Summary:
ERROR count: 0
WARNING count: 3
NOTE count: 6
For detailed information about these checks, see the BiocCheck
vignette, available at
https://bioconductor.org/packages/3.7/bioc/vignettes/BiocCheck/inst/doc/BiocCheck.html#interpreting-bioccheck-output
$error
character(0)

$warning
[1] "y of x.y.z version should be even in release"                                               
[2] "Update R version dependency from 3.4 to 3.5."                                               
[3] "The following files are over 5MB in size: '.git/objects/pack/pack-fa09b8fab3b8aef3e71f2ecc0640cfe4994fb0cd.pack'"

$note
[1] "Avoid sapply(); use vapply() found in files:"                                               
[2] "Use accessors; don't access S4 class slots via '@' in examples/vignettes."                  
[3] "Consider adding runnable examples to the following man pages which document exported objects:"
[4] "Consider shorter lines; 1 lines (0%) are > 80 characters long."                             
[5] "Consider multiples of 4 spaces for line indents, 264 lines(18%) are not."                   
[6] "Cannot determine whether maintainer is subscribed to the bioc-devel\nmailing list (requires admin credentials).  Subscribe here:\nhttps://stat.ethz.ch/mailman/listinfo/bioc-devel"

Warning on 'Sequential evaluation is strongly discouraged. Proceed with caution.'?

Just spotted:

Warning in biomarkertmle(se = illuminaData[1:2, ], varInt = varInt_index,  :
  Sequential evaluation is strongly discouraged. 
 Proceed with caution.

while running rev-dep checks. What the reason and rational for that warning? As it's written now, it sounds like something will go wrong or one cannot trust the results.

biotmle/R/biomarkertmle-wrapper.R

Lines 152 to 155 in 32e00bc

 warning(paste( 

 "Sequential evaluation is strongly discouraged.", 

 "\n Proceed with caution." 

 ))

interpretation of ATE "fold change"

"Fold change" is a misleading term for the output generated by the biomarker TMLE procedure. While consistent with terminology employed by the limma package (and in the bioinformatics literature in general), the results produced by the procedure are more similar to a log-risk difference (really, a log-ATE) rather than a fold change (~log-OR)...

succinctly convey analysis results

The Bioconductor developer guidelines recommend implementing a method for show (in this case show.biotmle) to better convey the information presented by the output object. Currently, biotmle objects are sub-classed from the standard SummarizedExperiment class, so the standard call to print/show only displays the relevant SummarizedExperiment information, while the analysis results are hidden (though a tutorial on how to extract them is treated in the vignette). This should be corrected ASAP.

Robust variance-moderated testing of linear model parameters

The methodology implemented in this software package performs variance-moderated hypothesis testing of estimators of causal parameters (currently, only the ATE) in the nonparametric statistical model based on the values of the efficient influence function evaluated at each observation. Within the context of parametric models (e.g., the \beta coefficient of a linear model), it is possible to perform the same decomposition (since \hat{\beta} is an asymptotically linear estimator of \beta) and thus perform the same robust variance-moderated hypothesis test.

Technical Replicates Functionality

Include ability to specify technical replicates, similar to LIMMA

Limma Integration for RNA-seq

The call made to SuperLearner by the tmle function in the biomarkertmle function of this package fails when the matrix of potential biomarkers (Y) contains discrete values (counts are the standard in RNA-seq, and other NGS biotechnologies). As the substantive contribution of this methodology is implementing an influence curve transform (the tmleOut part of the object returned by biomarkertmle), it would be beneficial to integrate further with core functions of the limma package (specifically, the voom component) to allow for NGS data to be analyzed.

Specifically,

modtest_ic should be re-written to be a better wrapper around Limma's core functions: lmFit and topTable. In particular, it should be able to properly handle "voom"-transformed data.
For RNA-seq data, biomarkertmle should take as input (as Y) the voomObject$E matrix returned by limma::voom; this might most easily be done by creating a pre-processing function that would ensure that the input SummarizedExperiment object contains "voom"-transformed data in the assay slot.

Wishlist for Bioc3.8 Release

~~- [ ] address #36 by adding option/argument backend, giving the option to use either the tmle package (current implementation) or drtmle~~

re-think and implement color scheme for plots, moving away from wesanderson to ggsci (fits intended applications better) and use fall back on default color scheme for heatmaps (superheat uses viridis)
re-style code to follow tidyverse style conventions using styler
fix broken pkgdown site

biomarkertmle fails on data transposition

Hi, when I tried the script on the page https://code.nimahejazi.org/biotmle/articles/rnaseqProcessing.html the code is as follows:

se <- SummarizedExperiment(assays = list(counts = DataFrame(ngs_data)),
                           colData = DataFrame(design))
se
rnaseqTMLEout <- biomarkertmle(se = se,
                               varInt = 1,
                               type = "exposure",
                               ngscounts = TRUE,
                               parallel = TRUE,
                               family = "gaussian",
                               g_lib = c("SL.mean", "SL.glm", "SL.randomForest"),
                               Q_lib = c("SL.mean", "SL.glm", "SL.randomForest",
                                         "SL.nnet")
                              )

But it had a bug report:

Error in t.default(assay(se)) : argument is not a matrix
Calls: biomarkertmle -> as.data.frame -> t -> t.default
Execution halted

And I found the R script on Bioconductor is as follow. The code has been commented out.

se <- SummarizedExperiment(assays = list(counts = DataFrame(ngs_data)),
                           colData = DataFrame(design))
se

#  rnaseqTMLEout <- biomarkertmle(se = se,
#                                 varInt = 1,
#                                 type = "exposure",
#                                 ngscounts = TRUE,
#                                 parallel = TRUE,
#                                 family = "gaussian",
#                                 g_lib = c("SL.mean", "SL.glm", "SL.randomForest"),
#                                 Q_lib = c("SL.mean", "SL.glm", "SL.randomForest",
#                                           "SL.nnet")
#                                )
#  head(rnaseqTMLEout@tmleOut$E[, seq_len(6)])

data(rnaseqtmleOut)
head(rnaseqTMLEout@tmleOut$E[, seq_len(6)])

So, how can I fix the bug?

Toptable: Error: $ operator not defined for this S4 class

Below is another error I found executing the pipeline:

head(toptable(limmaTMLEout))
Error: $ operator not defined for this S4 class

Tibble problems in heatmap routine

Since a5482f5, the object in slot tmleOut was changed so as to be wrapped in as_tibble, which coerces it to multiple classes: tbl_df, tbl, data.frame. Upon invoking heatmap_ic, this results in the following warning:

Warning message:
In if (class(x@tmleOut) == "EList") { :
  the condition has length > 1 and only the first element will be used

Article Submission - JOSS

After implementing the combined Limma-TMLE variance shrinkage method, it would likely be worthwhile to submit a short article about the resultant R package to the Journal for Open Source Software (JOSS).

Using futures with BiocParallel

Henrik's relatively new package BiocParallel.FutureParam provides a way to combine the use of futures with the generic parallelization provided by #18 (the BiocParallel infrastructure). This presents an avenue to combine the previously conflicting proposals in #18 and #31.

parallelization is a dumpster fire

All of the logic in https://github.com/nhejazi/biotmle/blob/master/R/biotmle.R#L128-L152 should be re-written for simplification. Far too many options are set, and it's horribly unclear exactly what the parallelization scheme invoked is. In particular, there's not a good reason to support both of the Future and BiocParallel frameworks, and certainly no reason to invoke both. The constructor of biomarkertmle can be significantly simplified accordingly.

Workflow Article

Upon completion of the R package implementing this algorithm, it would be valuable to put together a workflow paper for the Bioconductor channel of F1000, demonstrating exactly what this approach adds to the existing suite of tools available for analyzing microarray and next generation sequencing data.

Backend package transition: drtmle

Currently, this package evaluates biomarkers by computing targeted minimum loss-based estimates of the average treatment effect on a biomarker-by-biomarker basis. Actual computation of the TMLE of the ATE is performed by calls to the tmle R package. Recently, the drtmle package (GitHub, CRAN), which computes doubly robust estimates of common parameters, including the ATE, was released via CRAN. A possible area of future work might include updating the backend estimation procedure to use drtmle rather than tmle.

unit testing

now that the JOSS review has begun, it's likely best to finish writing a suite of unit tests to cover all core functions in the package. Aspects of the following should definitely be tested:

biomarkerTMLE_exposure.R in biomarkertmle.R
biomarkerTMLE_outcome.R in biomarkertmle.R
moderated testing with limma in modtest_ic.R
the S4 class constructor in biotmle.R
functions in utils_rnaseq.R

plotting functions are exclude from the coverage report

adding open source standards

need a CONTRIBUTING.md file to allow users to contribute code in a clear manner
need usage examples for all user-facing functions

Error in eif

I don't think this is a dealbreaker but I did come across the following error:

head(eif(rnaseqTMLEout)$E[, seq_len(6)])
Error in eif(rnaseqTMLEout) : could not find function "eif"

Package hex sticker for BioC

hexSticker is an R package facilitating the creation of hex stickers from within R. It might be nice to create one and subsequently add it to the BioC stickers repo.

Implementing supervised distance matrices

The methodology implemented in this software package requires computing individual values of the efficient influence function (or just the influence function for a parametric model). Since these values are already computed / made available, a likely simple extension would be to use the matrix of EIF values to perform supervised clustering, as originally proposed in K.S. Pollard & M.J. van der Laan, "Supervised Distance Matrices", SAGMB, 2008. Heatmaps/plots based on this approach would likely provide a nice, user-friendly visualization.

	tt <- limma::topTable(
	fit = fit,
	coef = 1,
	number = Inf,
	adjust.method = adjust,
	sort.by = "none",
	...
	) %>%
	tibble::rownames_to_column(var = "ID") %>%
	tibble::as_tibble() %>%
	mutate(
	# remove log-fold change and overwrite average expression with ATE
	logFC = NULL,
	AveExpr = biotmle@ateOut,
	var_bayes = fit$s2.post / n_obs,
	)

	warning(paste(
	"Sequential evaluation is strongly discouraged.",
	"\n Proceed with caution."
	))