Git Product home page Git Product logo

metalearners's Introduction

metalearners

CI Documentation Status Conda-forge PypiVersion codecov.io

MetaLearners for Conditional Average Treatment Effect (CATE) estimation

The library focuses on providing

  • Methodologically sound cross-fitting
  • Convenient access to and reuse of base models
  • Consistent APIs across Metalearners
  • Support for more than binary treatment variants
  • Integrations with pandas, shap, lime, optuna and soon onnx

Example

df = ...

from metalearners import RLearner
from lightgbm import LGBMClassifier, LGBMRegressor

rlearner = RLearner(
    nuisance_model_factory=LGBMRegressor,
    propensity_model_factory=LGBMClassifier,
    treatment_model_factory=LGBMRegressor,
    is_classification=False,
    n_variants=2,
)

features = ["age", "weight", "height"]
rlearner.fit(df[features], df["treatment"], df["outcomes"])
cate_estimates = rlearner.predict(df[features], is_oos=False)

Please refer to our docs for many more in-depth and reproducible examples.

Installation

metalearners can either be installed via PyPI with

$ pip install metalearners

or via conda-forge with

$ conda install metalearners -c conda-forge

Development

Development instructions can be found here.

metalearners's People

Contributors

francescmartiescofetqc avatar kklein avatar quant-ranger[bot] avatar dependabot[bot] avatar apoorvalal avatar github-actions[bot] avatar

Stargazers

Alexander Fischer avatar Marc-Antoine Schmidt avatar Kamil Mielczarek avatar  avatar Lucas Shen Y. S. avatar Julian Budde avatar Yezi Chu avatar  avatar wolfi3 avatar Javier Moral avatar  avatar  avatar  avatar Alba Carballo Castro avatar  avatar  avatar David Masip avatar  avatar  avatar Robin Vaaler avatar  avatar Jan Tilly avatar Michal Heydel avatar Felix Moeller avatar  avatar  avatar

Watchers

Uwe L. Korn avatar Jan Tilly avatar  avatar  avatar Michal Heydel avatar Jan Jagusch avatar

Forkers

apoorvalal

metalearners's Issues

Provide nuisance estimates to pseudo-outcome methods

Status quo

As of now we have the following interface for the pseudo-outcome methods in the R-Learner and R-Learner:

  • DR-Learner

    def _pseudo_outcome(
    self,
    X: Matrix,
    y: Vector,
    w: Vector,
    treatment_variant: int,
    is_oos: bool,
    oos_method: OosMethod = OVERALL,
    epsilon: float = _EPSILON,
    ) -> np.ndarray:

  • R-Learner

    def _pseudo_outcome_and_weights(
    self,
    X: Matrix,
    y: Vector,
    w: Vector,
    treatment_variant: int,
    is_oos: bool,
    oos_method: OosMethod = OVERALL,
    mask: Vector | None = None,
    epsilon: float = _EPSILON,
    ) -> tuple[np.ndarray, np.ndarray]:

Since both pseudo outcome kinds require nuisance model estimates and since these are visibly not provided as input arguments, they are estimated as part of the respective pseudo outcome method.

Importantly, the pseudo outcome methods are treatment-variant specific. Yet, the nuisance estimates estimated as part of the pseudo outcome methods are not treatment variant specific:

  • In the case of the R-Learner, the overall outcome model $\hat{\mu}$ is applied on all data; the overall propensity model $\hat{e}$ is applied on all data. Only after the estimation is the data filtered wrt to the treatment variant at hand:

    y_estimates = self.predict_nuisance(
    X=X,
    is_oos=is_oos,
    model_kind=OUTCOME_MODEL,
    model_ord=0,
    oos_method=oos_method,
    )[mask]
    w_estimates = self.predict_nuisance(
    X=X,
    is_oos=is_oos,
    model_kind=PROPENSITY_MODEL,
    model_ord=0,
    oos_method=oos_method,
    )[mask]

  • In the case of the DR-Learner, the propensity $\hat{e}$ and all conditional average outcomes $\hat{mu}_k$ are estimated for all data points; filtering of variant-specific information only happens thereafter:

    conditional_average_outcome_estimates = (
    self.predict_conditional_average_outcomes(
    X=X,
    is_oos=is_oos,
    oos_method=oos_method,
    )
    )
    propensity_estimates = self.predict_nuisance(
    X=X,
    is_oos=is_oos,
    oos_method=oos_method,
    model_kind=PROPENSITY_MODEL,
    model_ord=0,
    )
    y0_estimate = conditional_average_outcome_estimates[:, 0]
    y1_estimate = conditional_average_outcome_estimates[:, treatment_variant]

Assessment

In the case of $k>2$ many treatment variants, the above approach causes needlessly much effort since the same nuisance estimates are created, i.e. repeated, for every single treatment variant, which is not considered to be the 'control'.

Computational burden aside, it is not clear that it is a better method interface that the pseudo outcome methods does the estimation itself. Wouldn't it feel more natural that (and concerns be better separated if) the pseudo outcome methods merely defined the pseudo outcome given the nuisance estimates, rather than estimating quantities itself?

Leakage in X-Learner in-sample prediction

Issue at hand

@ArseniyZvyagintsevQC brought the following to our attention:

Let us assume a binary treatment variant scenario in which we want to work with in-sample predictions, i.e. is_oos=False.

The current implementation would go about fitting five models, three of which considered nuisance models and two of which considered treatment models:

model target cross-fitting dataset stage name
$\hat{\mu}_0$ $Y_i$ $\{(X_i, Y_i) | W_i=0\}$ nuisance "treatment_variant"
$\hat{\mu}_1$ $Y_i$ $\{(X_i, Y_i) | W_i=1\}$ nuisance "treatment_variant"
$\hat{e}$ $W_i$ $\{(X_i, Y_i)\}$ nuisance/propensity "propensity_model"
$\hat{\tau}_0$ $\hat{\mu}(X_i) - Y_0$ $\{(X_i, Y_i) | W_i=0\}$ treatment "control_effect_model"
$\hat{\tau}_1$ $Y_i - \hat{\mu}(X_i)$ $\{(X_i, Y_i) | W_i=1\}$ treatment "treatment_effect_model"

More background on this here.

Note that each of these models is cross-fitted. More precisely, each is cross-fitted wrt the data it has seen at training time.

Let's suppose now that we are at inference time and encounter an in-sample data point $i$. Wlog, let's assume that $W_i=1$.
In order to come up with a CATE estimate, the predict method will run

  • $\hat{\tau}_0(X_i)$ with is_oos=True since this datapoint has not been seen during training time of the model $\hat{\tau}_0$
  • $\hat{\tau}_1(X_i)$ with is_oos=False since this datapoint has indeed been seen during the training time of the model $\hat{\tau}_1$

The latter call makes sure we avoid leakage in $\hat{\tau}_1$. The former call, however, does not completely avoid leakage:
even though $i$ hasn't been seen in the training of $\hat{\tau}_0$, it has been seen in $\hat{\mu}_1$, which is, in turn, used by $\hat{\tau}_0$. Therefore, the observed outcome $Y_i$ can leak into the estimate $\hat{\tau}(X_i)$.

Next steps

We can devise an extreme, naïve approach to counteract this issue by training every type of model once per datapoint. Clearly, this ensures the absence of data leakage. The challenge with this issue revolves around coming up with a design that

  • allows for arbitrary numbers (>1, <=n) of cross-fitting folds, i.e. not fixing it to be equal to the number of training data points
  • integrates well into the structure of the library

Model-specific intialization fails if a superset of expected base model keys is provided

Example

from metalearners import TLearner
from lightgbm import LGBMRegressor

tlearner = TLearner(
    nuisance_model_factory=LGBMRegressor,
    is_classification=False,
    n_variants=2,
    nuisance_model_params={"verbose": -1},
    feature_set={"variant_outcome_model": None, "useless_model": []}
)

tlearner.feature_set

yields

{'variant_outcome_model': {'variant_outcome_model': None, 'useless_model': []}}

We observe that the model dictionary has been constructed improperly. This was spotted by @MatthiasLuxQC .

Underlying problem

metalearners.metalearner._initialize_model_dict only returns the output if the set of provided keys is exactly equal to the set of expected keys:

https://github.com/Quantco/metalearners/blob/main/metalearners/metalearner.py#L95-L98

Instead, we should probably test that the provided keys are a superset of the expected keys.

Allow for passing of 'fixed' propensity scores

Several MetaLearners, such as the R-Learner or DR-Learner, have propensity base models.

As of now, they are trained -- just as all other base models -- based on the data passed through the MetaLearner's fit call.

In particular in cases of non-observational data, it might be interesting to pass 'fixed' propensity scores, as compared to trying to infer the propensities from the experiment data.

Next steps:

  • Clearly define in which scenarios it might be desirable to have 'fixed' propensity score estimates
  • Assess different implementation options and their design implications (e.g. does creating a wrapped 'model' predicting on the end-user side do the trick? Is it a reasonable suggestion to provide no features to the propensity model? If not, should the scores be provided in __init__, fit, predict?)

Sklearn Dependency update to 1.4 instead of 1.3

In various places the following function from sklearn root_mean_square_error is imported and used (e.g: here). The function was added with version 1.4 hence the pyproject.toml should be updated from 1.3 to 1.4 to reflect this dependency.

Would've liked to make a PR myself, but too much effort to understand how to get pixy pre-commit to run without documentation (to my understanding already work in progress :))

Implement `predict_conditional_average_outcomes` for `RLearner`?

All implemented MetaLearners allow the user to call predict_conditional_average_outcomes. At the beginning we thought this was not possible for the RLearner but I think the following formulas may work:

(For ease of notation I'll use $Y(k) := \mathbb{E}[Y_i(k)]$, $Y = \mathbb{E}[Y | X])$, $\tau(k) = \mathbb{E}[Y(k) - Y(0) | X]$ and $e(k) = \mathbb{P}[W = k | X]$)

We know this system of $K$ linear equations is true:

$$\begin{cases} Y(1) - Y(0) = \tau(1)\\\ Y(2) - Y(0) = \tau(2)\\\ \vdots \\\ Y(K) - Y(0) = \tau(K) \\\ e(0) Y(0) + e(1) Y(1) + \dots + e(K) Y(K) = Y \end{cases}$$

that we need to solve for $Y(0), Y(1), \dots, Y(K)$.

Isolating $Y(1), Y(2), \dots, Y(K)$ from each of the first $K-1$ equations and plugging it into the last we get:

$$e(0) Y(0) + e(1) (\tau(1) + Y(0)) + \dots + e(K) (\tau(K) + Y(0)) = Y$$

From this we can isolate $Y(0)$ as:

$$Y(0) = \frac{Y - \sum\limits_{i=1}^{K}e(i)\tau(i)}{e(0) + \sum\limits_{i=1}^{K} e(i)} = Y - \sum\limits_{i=1}^{K}e(i)\tau(i)$$

Where we used the fact that all the propensity scores should sum up to 1.

Finally we can compute all $Y(k)$ as $Y(k) = Y(0) + \tau(k)$.

I extracted this idea for the binary case from this code snippet from this reference:
Screenshot 2024-07-22 at 17 30 34

Any thoughts on this @kklein ?

MetaLearners to be implemented

MetaLearners are a family of approaches to estimate CATEs. This issue is supposed to track which concrete MetaLearners have already been implemented in this library.

Name Reference Implemented?
T-Learner https://arxiv.org/pdf/1706.03461
S-Learner https://arxiv.org/pdf/1706.03461
R-Learner https://arxiv.org/pdf/1712.04912
X-Learner https://arxiv.org/pdf/1706.03461
DR-Learner https://arxiv.org/pdf/2004.14497
RA-Learner https://arxiv.org/pdf/2101.10943
EP-Learner https://arxiv.org/pdf/2402.01972
U-Learner https://arxiv.org/pdf/1706.03461
F-Learner https://arxiv.org/pdf/1706.03461
M-Learner (a.k.a PW-Learner) https://arxiv.org/pdf/2101.10943

Please let us know if you'd like to use a MetaLearner -- whether already part of this list or not -- which is not yet implemented.

Implement an ensembler of `MetaLearner`s

sklearn provides a BaseEnsemble class which can be used to ensemble various Estimators.

Unfortunately, sklearn's BaseEnsemble does not work out of the box with a MetaLearner from metalearners due to differences in predict and fit signatures.

In order to facilitate the ensembling of CATE estimates from various MetaLearners, it would be useful to implement helpers.

Some open questions:

  • Should the ensemble be given trained MetaLearners or train the MetaLearners itself?
  • Should the ensemble require all MetaLearners to have been trained on exactly the same data?
  • Should the ensemble work with both, in-sample and out-of-sample data, too?

Allow for other prediction methods in `CrossFitEstimator`

Using a CrossFitEstimator one can only call predict and predict_proba from the inner models for oos predictions, it may be interesting to allow passing a string so other methods can be used, this may be useful for example in Survival models where there are sometimes multiple predictions possible, see here.

It may also be interesting to allow it in the metalearners, but this would be a second step.

Challenging "CATE estimation is not supervised learning"

This is not an issue or bug but there was no Discussions sections so I am asking away.

Let's start from the example in your docs:

image

Why can't I train a multi-output (2 in this specific example) neural network as a regressor and mask the loss for the missing targets? Such masking is quite standard practice in all sorts of neural network use cases, e.g. when time series signals have different lengths etc.

So, here, CATE estimation exactly is supervised learning.

Add support for polars

As of now, covariates X, treatment assignments w and observey outcomes y can be provided as numpy datastructures (np.ndarray) or as pandas datastructures (pd.Series and pd.DataFrame) respectively.

A PR to allow for X to be scipy.sparse.csr_matrix is in the making: PR #86

It might be beneficial to allow for polars datastructures, too.

One question that might arise is how we deal with a potential additional dependency. Do we want to wrap every polars-dependent piece of code in a try-block that tries to import? Do we want to make polars a run dependency of metalearners?

If you'd like to use metalearners with polars please let us know. :)

Evaluate method fails if feature_set is not None

Initialize X, R, or DR metalearner with feature_set specifying what columns are used for what base models
Fit it
Evaluate it

Evaluation fails with an error "Number of features must match the input..."

Remove git_root from run requirements?

Does it make sense to have git_root as run requirement? If I install this package from PyPI or conda-forge, there's no guarantee that I'm running this inside a git repo. The only two helper functions inside the package that use git_root are

Here you can just add an argument that tells the functions where to download the data to. If you remove git_root there, it's just a development requirement afterwards.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.