raamana / confounds Goto Github PK

Conquering confounds and covariates: methods, library and guidance

Home Page: https://raamana.github.io/confounds

License: Apache License 2.0

Makefile 1.65% Python 98.35%

confound covariates machine-learning cross-validation scikit-learn statistics regression classification neuroimaging neuroscience

confounds's Introduction

Hi there 👋

confounds's People

Contributors

Stargazers

Watchers

Forkers

dinga92 saigerutherford tjays7 christiangerloff zuxfoucault neuroquant ljollans jrasero vishalbelsare vincent-wq nian-jingqing maxwellreynolds sinhaharsh

confounds's Issues

Error fitting Residualize

confounds version: 0.1.1
Python version: 3.9.7
Operating System: macOS 11.6

Description

I tried to run the example code with some dummy data, but get an error when I try to fit Residualize

What I Did

# Using the diabetes dataset as an example
from sklearn import datasets

df = datasets.load_diabetes(as_frame=True)['data']
X = df[['bmi', 'age', 's1']].values # some predictors
y = df['s6'].values # the outcome variable
c = df['sex'].values # a confound - does not matter which

# Splitting into a training and a test set
from sklearn.model_selection import train_test_split

train_ind, test_ind = train_test_split(np.arange(0, len(y)), test_size=0.2)
train_X = X[train_ind, :]
train_y = y[train_ind]
train_C = c[train_ind]

test_X = X[test_ind, :]
test_y = y[test_ind]
test_C = c[test_ind]

# Fitting Residualize to remove the confound
from confounds import Residualize

resid = Residualize()
resid.fit(train_X, train_C)
deconf_train_X = resid.transform(train_X, train_C)

Error message:

TypeError: check_is_fitted() takes from 1 to 2 positional arguments but 3 were given
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/m0/mddm8pfx1vs3q52qvgx4mpxw0000gp/T/ipykernel_27134/3595471338.py in <module>
      1 resid = Residualize()
      2 resid.fit(train_X, train_C)
----> 3 deconf_train_X = resid.transform(train_X, train_C)

/opt/anaconda3/envs/brain_shadows/lib/python3.9/site-packages/confounds/base.py in transform(self, X, y)
    186         """Placeholder to pass sklearn conventions"""
    187 
--> 188         return self._transform(X, y)
    189 
    190 

/opt/anaconda3/envs/brain_shadows/lib/python3.9/site-packages/confounds/base.py in _transform(self, test_features, test_confounds)
    192         """Actual deconfounding of the test features"""
    193 
--> 194         check_is_fitted(self, 'model_', 'n_features_')
    195         test_features = check_array(test_features, accept_sparse=True)
    196 

TypeError: check_is_fitted() takes from 1 to 2 positional arguments but 3 were given

Comment

It looks like there is some incompatibility, but I'm not sure what package is causing the error. Any help would be greatly appreciated!

Add generator for simualted or real confounded datasets

better validation of inputs to Deconfounders

the #19 reminds me of how some users can be confused given the code lets the second argument to .fit() and .transform() optional with y=None. The only reason we have y=None is to try follow sklearn conventions and to pass their tests, but given we can't pass them anyway, we should tighten them up and make it an error to not supply the second [necessary] input argument.

cc @jrasero @jameschapman19

drop-in replacements for cross_val_predict and cross_val_score etc

Pradeep,

could something like this be of interest for the library?

The idea would be to create a class that would do fit and predict including deconfounding and the use of the estimator in an encapsulated way.

Below is a skeleton example. This would only deconfound the input data.

cross_val_predict and cross_val_score functions could as well be implemented.

from sklearn.base import clone

class SklearnWrapper():

    def __init__(self,
                 deconfounder,
                 estimator):

        self.deconfounder = deconfounder
        self.estimator = estimator

    def fit(self,
            input_data,
            target_data,
            confounders,
            sample_weight=None):

        # clone input arguments
        deconfounder = clone(self.deconfounder)
        estimator = clone(self.estimator)

        # Deconfound input data
        deconf_input = deconfounder.fit_transform(input_data, confounders)
        self.deconfounder_ = deconfounder

        # Fit deconfounded input data
        estimator.fit(deconf_input, target_data, sample_weight)
        self.estimator_ = estimator

        return self

    def predict(self,
                input_data,
                confounders):

        deconf_input = self.deconfounder_.transform(input_data, confounders)

        return self.estimator_.predict(deconf_input)

Implement metrics to quantify confound to target relationships

some ideas are correlation, R^2, delta R^2 etc

Add tutorial notebooks, with few example use-cases

the usage can be easily turned into a tutorial notebook: https://raamana.github.io/confounds/usage.html

we can add more depending on the utilities and helpers etc

Including causal discovery methods such as LiNGAM

Should we consider offering causal discovery based on LiNGAM ?
For ex. Yang [2] applies LiNGAM for recognizing brain connectivity patterns with fMRI data.

References

Python package for causal discovery based on LiNGAM : https://www.jmlr.org/papers/v24/21-0321.html
Yang and Suzuki, The Functional LiNGAM

Adding more documentation

References to

literature on confounding in general and types of bias that are not due to confounding
Recent methods in the functional genomics literature such as RUV-4 and SCmerge
Papers evaluating harmonization methods and whether or not they succeed at de-confounding.
https://pubmed.ncbi.nlm.nih.gov/22101192/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4679071/

Add tests for DummyDeconfounding() method

Methods and metrics to quantify the level of confounding in a given a sample

May be more than a good first issue.. but depending on the skill level, this can be a good first issue.

Performance score stratified by confound

utils.score_stratified_by_confound()

Helper to summarize the performance score (accuracy, MSE, MAE etc) for each
level or variant of confound. This is helpful to assess any bias towards a
particular value when confounds are categorical (such as site or gender). So
if the MSE (of target) for Females is much lower compared to Males, then it
may indicate a potential bias of the model towards Females (due to imbalance in
size?)

Implement Propensity Score estimation

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

confounds version:0.1.3
Python version: 3.10
Operating System: mac

Having the error eventhough the data is clean, has no Na nor NaN values

C_sample = graph_corr_1[confound_cols]
X_sample = graph_corr_1.drop(confound_cols, axis=1)

resid = Residualize()
resid.fit(C_sample)
graph_corr_2 = resid.transform(C_sample)