nhejazi / txshift Goto Github PK

:package: :game_die: R/txshift: Efficient Estimation of the Causal Effects of Stochastic Interventions, with Corrections for Outcome-Dependent Sampling

Home Page: https://codex.nimahejazi.org/txshift

License: Other

Makefile 0.34% R 91.49% TeX 8.17%

causal-inference targeted-learning machine-learning stochastic-interventions treatment-effects variable-importance censored-data causal-effects statistics robust-statistics

txshift's People

Contributors

Stargazers

Watchers

Forkers

benkeser ehsanx lauren-eylerdang

txshift's Issues

Remove pin for sl3@devel

master is currently ahead of devel, as we've stopped using the devel branch.

scaling outcome before estimation of Q

Hi Nima, It appears that txshift is scaling the outcome prior to estimating Q. Is it possible to add an option so that Q can be estimated on its original scale? Thank you!

checking conditional density fits

Setting 1:
W_1 ~ Binom(1/2)
\Delta | W_1 = w_1 ~ Binom(plogis(w_1))
A | W_1 = w_1 ~ Normal(2*w_1, 1)
\Delta A = \Delta * A
O = (W_1, \Delta A)

Simulate larger and larger data sets (and more and more bins) and fit condensier 2 ways:

no weights -- given \Delta = 1, fit condensier A ~ W_1
sane estimates => W_1 = 0 condensier fit should look like Normal(0,1)
W_1 = 1 condensier fit should look like Normal(2,1)
weights -- given \Delta = 1, fit condensier A ~ W_1, weights = 1/plogis(w_1) (i.e., use true weights)
sane estimates => W_1 = 0 condensier fit should look like Normal(0,1)
W_1 = 1 condensier fit should look like Normal(2,1)
Setting 2:
W_1 ~ Binom(1/2)
W_2 ~ Binom(1/2)
\Delta | W_1 = w_1, W_2 = w_2 ~ Binom(plogis(w_1 + w_2))
A | W_1 = w_1, W_2 = w_2 ~ Normal(2w_1w_2, 1)
\Delta A = \Delta * A
\Delta W_2 = \Delta * W_2
O = (W_1, \Delta W_2, \Delta A)

fit condensier 2 ways:

no weights -- given \Delta = 1, fit condensier A ~ W_1 + W_2
sane estimates => W_1 = 1, W_2 = 1 condensier should look like Normal(2,1)
else Normal(0,1)
weights -- given \Delta = 1, fit condensier A ~ W_1 + W_2, weights = 1/plogis(w_1 + w_2)
sane estimates => W_1 = 1, W_2 = 1 condensier should look like Normal(2,1)
else Normal(0,1)
The reason for checking both ways is the following. We initially thought that we would need weights for our problem in order to obtain a valid density estimate. The problem is as above, where there is some biased (e.g., case-control) sampling. However, it now seems to me that if W is the whole set of confounders of A and Delta, then you could actually just estimate the density in Delta = 1 folks and still be ok. This is because the observed data conditional density given Delta = 1 and W is the same as the full data conditional density given W (because \Delta \perp A | W). Does this seem right to you? Or am I crazy?

In any case, Nima is going to check the fits in both cases and we'll see. Right now I'm guessing there will be finite-sample differences but asymptotically you'll end up with the same quantity.

Can you provide a sentence or two (or three) at the start for a non-specialist audience?

Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?

Checking estimation consistency

For density estimation with condensier:

For outcome estimation with Super Learner (via sl3):
- Predict on a new data set for different values of A (e.g. from –4 to 4) and with W = 0.
- Make a scatter plot of Y[W = 0]. Add the prediction line. Add the true line.
- Repeat the above for W = 1.

error handling matrix W

For some reason, the following simple example (with a matrix W, instead of a vector) does not work:

library(data.table)
library(tidyverse)
library(condensier)
library(txshift)
set.seed(429153)

# simulate simple data for tmle-shift sketch
n_obs <- 1000  # number of observations
n_w <- 3  # number of baseline covariates
tx_mult <- 2  # multiplier for the effect of W = 1 on the treatment

## baseline covariates -- simple, binary
W <- replicate(n_w, rbinom(n_obs, 1, 0.5))

## create treatment based on baseline W
A <- as.numeric(rnorm(n_obs, mean = tx_mult * (rowSums(W) > 2), sd = 1))

# create outcome as a linear function of A, W + white noise
Y <- A + (1 / exp(rowSums(W))) + rnorm(n_obs, mean = 0, sd = 1)

# create data for examination
obs <- as.data.table(list(W, A, Y))
setnames(obs, c(paste0("W", seq_len(ncol(W))), "A", "Y"))

tmle_glm_shift_1 <- tmle_txshift(W = W, A = A, Y = Y, delta = 0.5,
                                 fluc_method = "standard",
                                 ipcw_fit_args = NULL,
                                 g_fit_args = list(fit_type = "glm", nbins = 20,
                                                   bin_method = "dhist",
                                                   bin_estimator = speedglmR6$new(),
                                                   parfit = FALSE),
                                 Q_fit_args = list(fit_type = "glm",
                                                   glm_formula = "Y ~ .")
                                )
tmle_glm_shift_1

...it returns an error:

Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), : 'data' must be of a vector type, was 'NULL'

delta specification issue

I'm using the version of txshift currently up on CRAN. My understanding of what this package is doing may be flawed, but I think the specification of delta is broken somewhere. If specifying delta = 0, for continuous treatment (A) I intuit this means we are estimating E_n(E(Y | A=a+0, W=w)) - E_n(E(Y | A=a, W=w)) = 0. This does not seem to be what the package is estimating - see example below (based on example in package documentation).

n_obs <- 1000  # number of observations
n_w <- 1  # number of baseline covariates
tx_mult <- 2  # multiplier for the effect of W = 1 on the treatment
W <- as.numeric(replicate(n_w, rbinom(n_obs, 1, 0.5)))
A <- as.numeric(rnorm(n_obs, mean = tx_mult * W, sd = 1))
Y <- A + W + rnorm(n_obs, mean = 0, sd = 1)


tmle_hal_shift_1 <- txshift(
  W = W, A = A, Y = Y, delta = 0.0,
  fluctuation = "standard",
  g_exp_fit_args = list(fit_type = "hal", n_bins = 5,
                        grid_type = "equal_mass",
                        lambda_seq = exp(seq(-1, -12, length = 500))),
  Q_fit_args = list(fit_type = "glm", glm_formula = "Y ~ .")
)
tmle_hal_shift_1

Counterfactual Mean of Shifted Treatment
Intervention: Treatment + 0

txshift Estimator: tmle
Estimate: 1.537
Std. Error: 0.0663
95% CI: [1.407, 1.6671]

Implementing AIPW

We should implement the augmented IPW estimator -- in the IPCW case, this requires initial estimates of pi, g, Q, and the IPCW-EIF constructed from the initial (unfluctuated) components. The full data case is even simpler. Recall that the EIF mean needs to be added to the substitution estimator computed using the initial estimates.

Problems with CI coverage

There's something horribly wrong with the parameter estimate -- the Monte Carlo variance is quite high in small samples and still bad even in large samples, affecting confidence interval coverage quite significantly:

n	est	var	bias	coverage
100	1.841118	0.6168012	-0.1588817	0.282
10000	1.998961	0.0005663	-0.0010387	0.917

The bug reported here was introduced at some point between ee46907 and 81175b1.

High-level intro in JOSS paper

Improve JOSS paper by adding a high-level introduction ("Begin your paper with a summary of the high-level functionality of your software for a non-specialist reader. Avoid jargon in this section)", as per openjournals/joss-reviews#2222

tracking results over iterations

Things to have when invoking the IPCW-TML estimation process:

Maybe good to keep track of how psi changes with iteration (including an initial estimate of psi)
Might as well also keep track of the variance estimate too, so we can examine e.g., how well the first step tmle performs vs. the full blown iterative
Good idea to keep track of the number of iterations

adapted from #7 by @benkeser.

IPCW GLM parallelization fails due to glm.fit with censoring

There appears to be some weird interaction between an invocation of glm.fit and the future ecosystem of packages -- that is, the error only appears reproducible when doFuture is used to parallelize runs of the wrapper function tmle_shifttx BUT this only happens when the censoring variable is provided as an input. This appears to be a rather convoluted error that may require substantial investigation, as the exact same block of code runs without error when the type of TMLE is changed to use Super Learner via sl3.

In particular, this means that invocations of the IPCW-TMLE with the glm flavor ought to be avoided (or, at least, not trusted) for the time being. There is no evidence (yet) that there is anything flawed with the IPCW-TMLE that relies on Super Learning.

Since the routines in question are not that different, the error implies that the problematic code occurs in this block.

And here is the relevant error:

Bounding natural conditional densities

Currently, in generating the auxiliary covariate for the efficient influence function, a bounding-type procedure is implemented for the post-intervention (counterfactual) conditional density ratio, where non-finite values are set to 1. This should be extended to the case of natural/observed conditional densities, since poor estimates of such could lead to numerical instability in downstream steps of the procedure. See https://github.com/nhejazi/txshift/blob/master/R/fit_mechanisms.R#L443-L448. Suggested by @jeremyrcoyle.

Can you provide examples for all the functions

Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).

Add support for censoring

We need to add support for a censoring node C that follows the exposure in time-ordering. Thus, in the case of the IPCW-augmented estimators, the data structure will be of the form O = (W, A, C, Y, Delta), where Delta = f(W, Y) is a two-phase sampling indicator. The nuisance regression for the mechanism g_C, the probability of censoring conditional on covariates, should be handled similarly to the nuisance regressions for the exposure and outcome.

User-specified g, Q, and Pi fits

We should have a way of including user-specified estimates for Q, g, and Pi. This should be trivial to implement.

@benkeser concurs
"In general, I think the most important next step is to add the functionality to input our own estimates of Q, g, and Pi. I think we could be seeing excess bias due to the fact that g is not being estimated well (too few or too many bins) AND the regression of the EIF ~ V | Delta = 1 is not being estimated well (surely the main terms GLM is misspecified). Because we have the nice multiple robustness property, we are seeing that the estimates are still consistent...in spite of the fact that we’re doing a shitty job estimating EIF ~ V | Delta = 1 (and g too, since if we fix the number of bins there will be bias asymptotically). However, we’re saved by the fact that we get Q and we get Pi. In fact, we’re truly saved by the fact that we get those guys with a GLM, so that we’re still asymptotically linear. It’s (very) cool that we can already see that this is the case. Nevertheless, the finite-sample coverage is still garbage, so we see that for inference in small samples, we need to be doing a better job getting all the relevant nuisance parameters."

check sanity of fluctuation models

For a fix, see: https://github.com/benkeser/drtmle/blob/c7bdf5bdad9f0ab6689604ab82fed34cd642ee99/R/fluctuate.R#L29

Please indicate how third parties can Seek support

openjournals/joss-reviews#2447

normalizing density estimates instead of targeting qn

In order to ensure efficiency of the IPCW-TMLE, we are presented with two avenues by which we may proceed (1) we could add an extra targeting step, wherein we obtain better estimates of q_n (which itself corresponds to only the part of the EIF having to do with the baseline covariates W) and a corresponding updated value of the fluctuation nuisance parameter \epsilon; or, (2) it is apparently just as good to normalize the finalized IPCW density estimates such that the estimated values form a proper density (according to vdL). Obviously, the latter of these is easier.

piecewise monotonicity of dose-response projections

The summarization procedure of projecting individual risk estimates onto a working linear MSM has been very useful, but it could potentially be improved somewhat by incorporating an option for montonicity. As is, we can fit either linear or piecewise linear working MSM summaries (no inference for the latter unless the knot point is pre-specified), but the individual risk estimates need not follow a monotonic pattern. In certain use-cases (e.g., VE curves), it may be useful to enforce monotonicity when it can be justified appropriately by background knowledge (e.g., observed biological responses).

Implementing Collaborative TMLE

It would be nice to eventually have an option to implement collaborative targeted maximum likelihood estimation (C-TMLE) for the shift intervention causal effect parameter implemented here.

Names for variables in IPCW regression

It seems like the function to estimate \Pi_0 requires a formula for glm to be specified using V1, V2, etc... It would be more helpful if the user could specify in terms of colnames(W) and Y.

JOSS paper

We should write a short paper for JOSS on this R package, since the software tool itself took quite a bit work outside of that presented in the soon-to-be-submitted manuscript.

Density estimation with HAL

For the time being, it would be quite useful to have a custom implementation of conditional density estimation, limited only to the use of HAL (via hal9001) as a learner for any given bin/grid. The method should use the pooled hazards approach to density estimation rather than sequential regressions along bins.

Improve flexibility of density estimation argument

The argument g_fit_args currently accepts two different inputs --- for type = "glm", the method allows further arguments to condensier to be passed in to construct an estimator using GLMs in the bins, while, for type = "sl", only a pre-built object of class Lrnr_sl need be provided. To increase flexibility, this argument should be loosened to allow the haldensify package to be used for density estimation, optionally instead of condensier.

IPC Weights for IPCW-TMLEs

In order to compute IPCW-TMLEs for two-stage sampling, it is necessary to weight the (unobserved) full-data likelihood with an IPCW-type weight. For details, consult the relevant manuscript, by Rose+vdL, 2011.

To implement IPCW-TMLEs for the parameter considered in the initial offering of the shifttx package, a weight term needs to be incorporated throughout the various functions in the package:

tmle_shifttx - the primary user-facing wrapper function...
- ...
- ...
est_g - to estimate the propensity score using the conditional density fitting options offered by the condensier R package.
- ...
est_Q - to estimate the outcome regression, which internally uses either glm or sl3.
- ...
- ...
fit_fluc - to fit the regression for submodel fluctuation. Internally, this uses only glm.
- ...
- ...

In order to implement the naïve (and inefficient) version of the IPCW-TMLE, only the above functions need be altered. In order to implement the fully efficient version of the estimator, a new function corresponding to the more complex derivation of the efficient influence function (incorporating terms for the IPC weight) must be implemented.