Git Product home page Git Product logo

nhejazi / biotmle Goto Github PK

View Code? Open in Web Editor NEW
4.0 7.0 2.0 122.73 MB

:package: :microscope: R/biotmle: Targeted Learning with Moderated Statistics for Biomarker Discovery

Home Page: https://code.nimahejazi.org/biotmle/

License: Other

R 84.96% Makefile 1.95% TeX 13.10%
bioinformatics biostatistics bioconductor statistics machine-learning causal-inference r bioconductor-packages bioconductor-package targeted-learning

biotmle's Introduction

hello, i'm Nima

I'm an academic (bio)statistician working at the interface of causal inference, machine learning, and non- and semi-parametric statistics. I'm passionate about building open-source software tools to improve the accessibility of modern, model-agnostic and assumption-lean methods for statistical inference and causal machine learning, and I'm especially excited by the applications of statistics to the biomedical and public health sciences.

Are you looking for open source software for targeted causal machine learning? Maybe you should check out the tlverse project and browse our free open-source handbook!

Nima's github stats

biotmle's People

Contributors

hpages avatar nhejazi avatar nturaga avatar philboileau avatar vobencha avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

guhjy masedki

biotmle's Issues

Warning on 'Sequential evaluation is strongly discouraged. Proceed with caution.'?

Just spotted:

Warning in biomarkertmle(se = illuminaData[1:2, ], varInt = varInt_index,  :
  Sequential evaluation is strongly discouraged. 
 Proceed with caution.

while running rev-dep checks. What the reason and rational for that warning? As it's written now, it sounds like something will go wrong or one cannot trust the results.

warning(paste(
"Sequential evaluation is strongly discouraged.",
"\n Proceed with caution."
))

Implementing supervised distance matrices

The methodology implemented in this software package requires computing individual values of the efficient influence function (or just the influence function for a parametric model). Since these values are already computed / made available, a likely simple extension would be to use the matrix of EIF values to perform supervised clustering, as originally proposed in K.S. Pollard & M.J. van der Laan, "Supervised Distance Matrices", SAGMB, 2008. Heatmaps/plots based on this approach would likely provide a nice, user-friendly visualization.

TODO

  • Finish multilevel TMLE-LIMMA algorithm
  • collect data + try the method
  • include satisfactory data into package
  • completion/submission of statistical manuscript
  • submission to JOSS

CIs based on tail bounds

We should implement several alternative methods for inference, not drawn directly from the normal approximation to the asymptotic distribution of influence functions. At least two classes of options are available

  1. Use of a logistic distribution (or other heavy-tailed distribution) as reference instead of normal/t (post-hoc)
  2. Bernstein, Bennett, or Hoeffding bounds for influence function-based estimators, as per https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2827893/

Parallelization via future

The future package provides facilities for registering backends for parallelization across systems with diverse types of schedulers (e.g., SLURM vs. SNOW, etc.). To make the parallelization machinery within this package easier to use, it is likely worth moving away from the use of foreach and over to future.

This directly conflicts the proposal in #18. Only one of these may be implemented.

Limma Integration for RNA-seq

The call made to SuperLearner by the tmle function in the biomarkertmle function of this package fails when the matrix of potential biomarkers (Y) contains discrete values (counts are the standard in RNA-seq, and other NGS biotechnologies). As the substantive contribution of this methodology is implementing an influence curve transform (the tmleOut part of the object returned by biomarkertmle), it would be beneficial to integrate further with core functions of the limma package (specifically, the voom component) to allow for NGS data to be analyzed.

Specifically,

  • modtest_ic should be re-written to be a better wrapper around Limma's core functions: lmFit and topTable. In particular, it should be able to properly handle "voom"-transformed data.
  • For RNA-seq data, biomarkertmle should take as input (as Y) the voomObject$E matrix returned by limma::voom; this might most easily be done by creating a pre-processing function that would ensure that the input SummarizedExperiment object contains "voom"-transformed data in the assay slot.

Workflow Article

Upon completion of the R package implementing this algorithm, it would be valuable to put together a workflow paper for the Bioconductor channel of F1000, demonstrating exactly what this approach adds to the existing suite of tools available for analyzing microarray and next generation sequencing data.

Error in superheat:

Hello, I'm following your excellent biomarker pipeline and ran across an error with the heatmap/dendrogram command (see below).

heatmap_ic(x = limmaTMLEout,
+            row.dendrogram = TRUE,
+            clustering.method = "hierarchical",
+            design = design, FDRcutoff = 1.0, top = 10)
**Error in superheat::superheat(as.matrix(biomarkerTMLEout_top), row.dendrogram = TRUE,  : 
  formal argument "row.dendrogram" matched by multiple actual arguments**

I installed the superheat package so I'm not sure where the error is coming from?

BioC 3.8 - BiocCheck

A current run of BiocCheck::BiocCheck(".") returns a few interesting NOTEs, given below

  • "Avoid sapply(); use vapply() found in files:"
  • "Use accessors; don't access S4 class slots via '@' in examples/vignettes."
Summary:
ERROR count: 0
WARNING count: 3
NOTE count: 6
For detailed information about these checks, see the BiocCheck
vignette, available at
https://bioconductor.org/packages/3.7/bioc/vignettes/BiocCheck/inst/doc/BiocCheck.html#interpreting-bioccheck-output
$error
character(0)

$warning
[1] "y of x.y.z version should be even in release"                                               
[2] "Update R version dependency from 3.4 to 3.5."                                               
[3] "The following files are over 5MB in size: '.git/objects/pack/pack-fa09b8fab3b8aef3e71f2ecc0640cfe4994fb0cd.pack'"

$note
[1] "Avoid sapply(); use vapply() found in files:"                                               
[2] "Use accessors; don't access S4 class slots via '@' in examples/vignettes."                  
[3] "Consider adding runnable examples to the following man pages which document exported objects:"
[4] "Consider shorter lines; 1 lines (0%) are > 80 characters long."                             
[5] "Consider multiples of 4 spaces for line indents, 264 lines(18%) are not."                   
[6] "Cannot determine whether maintainer is subscribed to the bioc-devel\nmailing list (requires admin credentials).  Subscribe here:\nhttps://stat.ethz.ch/mailman/listinfo/bioc-devel"

Simultaneous CIs

We should implement several approaches for generating simultaneous confidence intervals covering the ATE across the full set of specified probes. A few ideas:

`SummarizedExperiment` row names don't propagate to `modtest_ic()` output

tt <- limma::topTable(
fit = fit,
coef = 1,
number = Inf,
adjust.method = adjust,
sort.by = "none",
...
) %>%
tibble::rownames_to_column(var = "ID") %>%
tibble::as_tibble() %>%
mutate(
# remove log-fold change and overwrite average expression with ATE
logFC = NULL,
AveExpr = biotmle@ateOut,
var_bayes = fit$s2.post / n_obs,
)

The row names from the original SummarizedExperiment object fed to to biomarkertmle() don't appear as row names in the topTable slot of the modtest_ic() output. I think that this can be fixed by removing line 81 from above, and adding ID = rownames(biotmle) to the mutate() call.

Parallelization via BiocParallel

The original (and current) implementation of this technique used the "foreach" construct to perform parallel computation to assess the large numbers of biomarkers that are generally dealt with. While support for "foreach"-style computation should be kept intact, it would likely be worth adding a flag to the main functions, to allow the (optional) use of BiocParallel.

adding open source standards

  • need a CONTRIBUTING.md file to allow users to contribute code in a clear manner

  • need usage examples for all user-facing functions

Outcomes method for biomarkers

It appears that the idea of considering the impact of biomarkers on some kind of outcome measures was rather ill-conceived -- that is, with the current use of tmle::tmle for assessing the impact of biomarkers (on some kind of outcome via the ATE) requires that the expression measures associated with the biomarkers be discretized.

Generally speaking, there does not exist a reasonable way to discretize expression measures -- there is great disagreement in the genomics / high-dimensional biology literature -- and any estimates based on discretized biomarker expression values are likely to be un-meaningful scientifically.

Tibble problems in heatmap routine

Since a5482f5, the object in slot tmleOut was changed so as to be wrapped in as_tibble, which coerces it to multiple classes: tbl_df, tbl, data.frame. Upon invoking heatmap_ic, this results in the following warning:

Warning message:
In if (class(x@tmleOut) == "EList") { :
  the condition has length > 1 and only the first element will be used

Article Submission - JOSS

After implementing the combined Limma-TMLE variance shrinkage method, it would likely be worthwhile to submit a short article about the resultant R package to the Journal for Open Source Software (JOSS).

interpretation of ATE "fold change"

"Fold change" is a misleading term for the output generated by the biomarker TMLE procedure. While consistent with terminology employed by the limma package (and in the bioinformatics literature in general), the results produced by the procedure are more similar to a log-risk difference (really, a log-ATE) rather than a fold change (~log-OR)...

Robust variance-moderated testing of linear model parameters

The methodology implemented in this software package performs variance-moderated hypothesis testing of estimators of causal parameters (currently, only the ATE) in the nonparametric statistical model based on the values of the efficient influence function evaluated at each observation. Within the context of parametric models (e.g., the \beta coefficient of a linear model), it is possible to perform the same decomposition (since \hat{\beta} is an asymptotically linear estimator of \beta) and thus perform the same robust variance-moderated hypothesis test.

Backend package transition: drtmle

Currently, this package evaluates biomarkers by computing targeted minimum loss-based estimates of the average treatment effect on a biomarker-by-biomarker basis. Actual computation of the TMLE of the ATE is performed by calls to the tmle R package. Recently, the drtmle package (GitHub, CRAN), which computes doubly robust estimates of common parameters, including the ATE, was released via CRAN. A possible area of future work might include updating the backend estimation procedure to use drtmle rather than tmle.

preliminary R package

devtools::install_github("nhejazi/biotmle", subdir = "pkg", ref = "develop")

@ahubb40 @wilsoncai1992 A preliminary version of the TMLE-Limma R package is now available and can be installed using devtools with the command listed above.

TODO for Bioconductor

To-do list based on comments from the in-progress Bioconductor review:

  • switch illuminaData to use the SummarizedExperiment package and class
  • change biomarkertmle to accept input as objects of class SummarizedExperiment
  • re-write the biotmle class to simply exist as a subclass of SummarizedExperiment
  • explicitly check the "type" argument when first calling the biomarkertmle function
  • re-write heatmap function to make use of the SuperHeat package (see this issue)

example Illumina 2007 data set

  • look at .RData file in BenzeneIllumina2007 / Illumina Data 2007 / R subdirectory on Dropbox
  • note that the data set includes repeated measures, which should be dealt with appropriately
  • use this data set for testing the structure and current code associated with this package

unit testing

now that the JOSS review has begun, it's likely best to finish writing a suite of unit tests to cover all core functions in the package. Aspects of the following should definitely be tested:

  • biomarkerTMLE_exposure.R in biomarkertmle.R
  • biomarkerTMLE_outcome.R in biomarkertmle.R
  • moderated testing with limma in modtest_ic.R
  • the S4 class constructor in biotmle.R
  • functions in utils_rnaseq.R
  • plotting functions are exclude from the coverage report

appveyor build exits with code 5

Using the latest version of a GitHub package that isn't owned by my account causes a failure -- in particular, it appears that I am restricted to the CRAN version of Karthik's wesanderson color palette.

See the build log on Appveyor -- specifically, v1.0.101 -- for details. Package installation fails with exit code 5. It's unclear what this means or why this is happening, given that the Travis-CI build is successful.

Switch heatmap to use superheat?

Based on comments from the in-progress Bioconductor review, the NMF package seems to have been broken by recent updates in one of its dependencies (namely, grid). Considering that superheat will soon be on CRAN, it would be nice to use this (better) tool rather than the heatmap implementation (that was, admittedly) thrown together using NMF.

revdep check on wesanderson

๐Ÿ‘‹ @nhejazi
Just wanted to check if you're having any build issues with this package. I find that wesanderson fails on a revdep check here. But I suspect it was an issue earlier from using Zissou instead of Zissou1.

I see that this has been replaced with:

pal1 <- wesanderson::wes_palette("Royal2", 100, type = "continuous")

If this is correct, I'll ignore the failed check and proceed. Please advise.

Current error is:

   > heatmap_ic(x = limmaTMLEout, design = design, FDRcutoff = 0.05,
    +            top = 15)
    Error in wes_palette("Zissou", 100, type = "continuous") : 
      Palette not found.
    Calls: heatmap_ic -> wes_palette
    Execution halted

Thanks for your help ๐Ÿ™

What is the point of modtestOut?

The following error is produced by line 56 in modtest_ic:
cannot coerce class ""qr"" to a data.frame

The line in question attempts to change an MArrayLM object (produced by limma::eBayes to a data.frame via biotmle@modtestOut <- as.data.frame(fit).

Importantly, it is unclear what utility is gained by even saving the objected produced by limma::eBayes, as all inferential information is stored in the output of limma::topTable

Wishlist for Bioc3.8 Release

- [ ] address #36 by adding option/argument backend, giving the option to use either the tmle package (current implementation) or drtmle

  • re-think and implement color scheme for plots, moving away from wesanderson to ggsci (fits intended applications better) and use fall back on default color scheme for heatmaps (superheat uses viridis)
  • re-style code to follow tidyverse style conventions using styler
  • fix broken pkgdown site

succinctly convey analysis results

The Bioconductor developer guidelines recommend implementing a method for show (in this case show.biotmle) to better convey the information presented by the output object. Currently, biotmle objects are sub-classed from the standard SummarizedExperiment class, so the standard call to print/show only displays the relevant SummarizedExperiment information, while the analysis results are hidden (though a tutorial on how to extract them is treated in the vignette). This should be corrected ASAP.

Allow specification of functional forms for g and Q fits

The tmle package allows for the functional form of the nuisance parameters g and Q to be specified via the optional arguments gform and Qform, respectively. These arguments simply take in a character that is then translated to a formula for the glm function (via as.formula) --- e.g., gform = "A ~ W1 + W2". Since the machinery used for point estimation in this package relies on the tmle package (though the variance estimator is different here), an option should be added to allow gform and Qform (and, indeed, other arbitrary arguments) to be passed to the call to tmle in biomarkertmle_exposure. This would be most simply done via the addition of a ... argument.

Testing IC estimates

  • use only intercept in design matrix; otherwise, unclear what the t-test is actually doing...
  • using more than just the intercept term after applying the IC transform would result in a test that is not necessarily well-defined

Error in eif

I don't think this is a dealbreaker but I did come across the following error:

head(eif(rnaseqTMLEout)$E[, seq_len(6)])
Error in eif(rnaseqTMLEout) : could not find function "eif"

biomarkertmle fails on data transposition

Hi, when I tried the script on the page https://code.nimahejazi.org/biotmle/articles/rnaseqProcessing.html the code is as follows:

se <- SummarizedExperiment(assays = list(counts = DataFrame(ngs_data)),
                           colData = DataFrame(design))
se
rnaseqTMLEout <- biomarkertmle(se = se,
                               varInt = 1,
                               type = "exposure",
                               ngscounts = TRUE,
                               parallel = TRUE,
                               family = "gaussian",
                               g_lib = c("SL.mean", "SL.glm", "SL.randomForest"),
                               Q_lib = c("SL.mean", "SL.glm", "SL.randomForest",
                                         "SL.nnet")
                              )

But it had a bug report:

Error in t.default(assay(se)) : argument is not a matrix
Calls: biomarkertmle -> as.data.frame -> t -> t.default
Execution halted

And I found the R script on Bioconductor is as follow. The code has been commented out.

se <- SummarizedExperiment(assays = list(counts = DataFrame(ngs_data)),
                           colData = DataFrame(design))
se
#  rnaseqTMLEout <- biomarkertmle(se = se,
#                                 varInt = 1,
#                                 type = "exposure",
#                                 ngscounts = TRUE,
#                                 parallel = TRUE,
#                                 family = "gaussian",
#                                 g_lib = c("SL.mean", "SL.glm", "SL.randomForest"),
#                                 Q_lib = c("SL.mean", "SL.glm", "SL.randomForest",
#                                           "SL.nnet")
#                                )
#  head(rnaseqTMLEout@tmleOut$E[, seq_len(6)])
data(rnaseqtmleOut)
head(rnaseqTMLEout@tmleOut$E[, seq_len(6)])

So, how can I fix the bug?

BioC 3.6 Package Maintenance

After this package was accepted in Bioconductor, the copy made available through their repository was switched over from using git to svn for version control.

As development is taking place on the version hosted on GitHub, it's important to maintain some form of synchronicity between the git and svn histories. Some resources on this:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.