thomasbury / arfs Goto Github PK

All Relevant Feature Selection

License: MIT License

Python 100.00%

shadow-features boruta feature-selection allrelevant lightgbm discretization-algorithm feature-engineering machine-learning mrmr python

arfs's Introduction

arfs's People

Contributors

Stargazers

Watchers

Forkers

bitsbuffer partiallycomplex rrieunier kiminh vegajav dream-faster pacman1984 zdanaceau arifmudi harper357 apexjmr jlopezpena

arfs's Issues

arfs.feature_selection module not found

ModuleNotFoundError: no module named feature_selection

CollinearityThreshold has the wrong default

arfs/src/arfs/feature_selection/unsupervised.py

Lines 302 to 310 in bbcc785

 def __init__( 

 self, 

 threshold=0.80, 

 method="association", 

 n_jobs=1, 

 nom_nom_assoc=weighted_theils_u, 

 num_num_assoc=weighted_corr, 

 nom_num_assoc=correlation_ratio, 

 ):

On L308, the default is weighted_corr, but weighted_corr defaults to 'pearsonnotspearman` like the documentation says it is.

arfs/src/arfs/feature_selection/unsupervised.py

Line 265 in bbcc785

num_num_assoc : str or callable, default = "spearman"

I also double check it compared to scipy.stats.spearmanr and they are not equal.

[Enhancement] add random forest booster for lightGBM and optimal feature fraction

Implement option for using lightGBM in random forest boosting mode
Set the optimal feature fraction (regression and classification) according to "Elements of Statistical Learning"
Define LightForestRegressor and LightForestClassifier

is the package available?

I can't wait trying this package for my own projects.

Max on the wrong axis in _reduce_vars_sklearn

Hello, I think I found a possible bug in src/arfs/feature_selection/allrelevant.py, in function _reduce_vars_sklearn:

    # Get mean value from the shadows (max, like in Boruta, median to mitigate variance)
    mean_shadow = (
        shadow_vars.select_dtypes(include=[np.number])
        .max(skipna=True)
        .mean(skipna=True)
        / cutoff
    )

The bug is about max, which should have axis=1. Without it, It takes the maximum shap of all features, for each iteration. But I would expect the contrary, which is the maximum shap of all iteration, for each feature.

I may be wrong here, but I had to ask. Thank you!

Side note: probably that comment above the function is outdated :)

ImportError: cannot import name '_check_savefig_extra_args' from 'matplotlib.backend_bases'

When using leshy, it works normally, but it can't print out important pictures

Possible bugs in `CollinearityThreshold`

Hi Thanks for writing such a great module!

After running CollinearityThreshold.fit_transform() on some of my data, I was trying to look into which unselected features are collinear with my selected features. I was trying to look at the assoc_matrix_ which told me that 502/643 had no values above my threshold. This is in contrast to the number of selected features which was only 231/643. Spot checking some of the not selected features showed that they also never had a value above the threshold. This then led me to the code for dropping features and I am a little confused by it.

arfs/src/arfs/feature_selection/unsupervised.py

Lines 426 to 463 in bbcc785

 def _most_collinear(association_matrix, threshold): 

 cols_to_drop = [ 

 column 

 for column in association_matrix.columns 

 if any(association_matrix.loc[:, column].abs() > threshold) 

 ] 

 rows_to_drop = [ 

 row 

 for row in association_matrix.index 

 if any(association_matrix.loc[row, :].abs() > threshold) 

 ] 

 to_drop = list(set(cols_to_drop).union(set(rows_to_drop))) 

 most_collinear_series = ( 

 association_matrix[to_drop].abs().sum(axis=1).sort_values(ascending=False) 

 ) 

 most_collinear_series += ( 

 association_matrix[to_drop].abs().sum(axis=0).sort_values(ascending=False) 

 ) 

 most_collinear_series /= 2 

 return most_collinear_series.index[0], to_drop 

 def _recursive_collinear_elimination(association_matrix, threshold): 

 dum = association_matrix.copy() 

 most_collinear_features = [] 

 while True: 

 most_collinear_feature, to_drop = _most_collinear(dum, threshold) 

 # Break if no more features to drop 

 if not to_drop: 

 break 

 if most_collinear_feature not in most_collinear_features: 

 most_collinear_features.append(most_collinear_feature) 

 dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature) 

 return most_collinear_features

In lines 438-444, it looks like you are trying to sum the row and column for a given feature, and return the feature with the highest average, correct? However, association_matrix[to_drop] == association_matrix.loc[:,to_drop] so in L439 your index would be all features instead of just features into_drop.

Second, in L439 and L442 you sort the series, but I believe you should just be doing a final sort in L444 or L445.

>>> series = pd.Series([1, 0], index=['A', 'B'])
>>> series += pd.Series([1.4, 0.1], index=['B', 'A'])
>>> print(series)
A    1.1
B    1.4
dtype: float64

Combined, these two things seem to result in the incorrect feature being dropped.

Related, but not a bug, I found changing L427-L436 to the following resulted in a huge speedup (998 ms ± 26.7 ms per loop vs 27.3 s ± 558 ms per loop) in calling _recursive_collinear_elimination for me (n_features=643).

cols_to_drop = association_matrix.loc[
        :, (association_matrix.abs() > threshold).any(axis=0)
    ].columns.values
rows_to_drop = association_matrix.loc[
        (association_matrix.abs() > threshold).any(axis=1), :
    ].index.values

[Doc] max_tree_depth LGBM

Hi I wonder if it is necessary to limit the tree depth, as in boruta, and how much this parameter (and in general all LGBM settings, leaves, boosting rounds...) could impact BoostAGroota. Is it worth including in the doc ?
Thanks, great project

Issue with Overly Aggressive Feature Removal in CollinearityThreshold Class

Description

Problem

The CollinearityThreshold class in our codebase is intended to remove collinear features from datasets. However, it appears to be dropping features that do not meet the specified collinearity threshold, leading to the potential loss of important data. An example of this issue is the unwarranted removal of the 'age' column in the titanic dataset provided in the examples, where the association values are below the set threshold.

Expected Behavior

The class should only remove features that are collinear above the specified threshold. Features with association values below this threshold should be retained in the dataset.

Current Behavior

The class is removing features that do not exceed the collinearity threshold. This behavior is observed in the recursive feature elimination process, where features are being dropped inappropriately.

Steps to Reproduce

Initialize the CollinearityThreshold with a specific threshold.
Fit the selector to a dataset.
Observe that features with association values below the threshold are also being removed.

Suggested Fix

Modify the _recursive_collinear_elimination method to ensure it accurately removes only those features that exceed the specified collinearity threshold. The proposed change includes adding a condition to break the while loop when no more features exceed the threshold, preventing the unnecessary removal of features.

Old Version

def _recursive_collinear_elimination(association_matrix, threshold):
    dum = association_matrix.copy()
    most_collinear_features = []
    most_collinear_feature, to_drop = _most_collinear(association_matrix, threshold)
    most_collinear_features.append(most_collinear_feature)
    dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)

    while len(to_drop) > 1:
        most_collinear_feature, to_drop = _most_collinear(dum, threshold)
        most_collinear_features.append(most_collinear_feature)
        dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)
    return most_collinear_features

New Version

def _recursive_collinear_elimination(association_matrix, threshold):
    dum = association_matrix.copy()
    most_collinear_features = []

    while True:
        most_collinear_feature, to_drop = _most_collinear(dum, threshold)
        
        # Break if no more features to drop
        if not to_drop:
            break

        if most_collinear_feature not in most_collinear_features:
            most_collinear_features.append(most_collinear_feature)
            dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)

    return most_collinear_features

[BUG] - add a safeguard when there is a single categorical column

to prevent passing a single categorical column to the nom-nom association (Theil and Cramer matrix)

Duplicated feature importance columns in reduce_vars_sklearn

Hello, I was analyzing the source code of BoostAGroota, and I noticed a couple of strange things. I will show the first here, and the other in another issue to keep things tidy.

In src/arfs/feature_selection/allrelevant.py in function _reduce_vars_sklearn there is this piece of code:

        if i == 1:
            df = pd.DataFrame({"feature": new_x.columns})
            df2 = df.copy()

        # Store the feature importances in df2
        try:
            # Normalize the feature importances
            df2["fscore" + str(i)] = importance / importance.sum()
        except ValueError:
            print("Only Sklearn tree based methods allowed")

        # Merge the current feature importances with the existing ones in df
        df = pd.merge(
            df, df2, on="feature", how="outer", suffixes=("_x" + str(i), "_y" + str(i))
        )

The issue is about the pd.merge: I noticed that, over n_iterations, it duplicates the feature importances of previous iterations because the same importance is present in both df and df2.

After n_iterations, I expect a dataframe of importance data of shape len(real_vars) + len(shadow_vars) x n_iterations, but instead the number of columns is way higher due to the pd.merge. As a result, the average importances calculated withdf["Mean"] = df.select_dtypes(include=[np.number]).mean(axis=1, skipna=True) will be slightly biased by the repeated columns.

You can check this behavior by comparing df and df2 right after the for loop. You will see that df has way too many columns, while df2 has the correct number of columns.

I hope I have described the problem well enough. Let me know what you think.

BoostAGroota works wrong with set_config(transform_output="pandas")

Hello, I've noticed that if you use set_config(transform_output="pandas") your BoostAGroota.transform methods works wrong. It shuffles columns of pandas DataFrame(which left after feature selection).

There is code snipper for reproduction of this problem.

import warnings
warnings.filterwarnings('ignore')

from sklearn import set_config
from lightgbm import LGBMRegressor

import arfs.feature_selection.allrelevant as arfsgroot
from arfs.utils import load_data

set_config(transform_output='pandas')

boston = load_data(name="Boston")
X, y = boston.data, boston.target

fs = arfsgroot.BoostAGroota(LGBMRegressor(n_estimators=1, random_state=42))
X_transformed = fs.fit_transform(X, y)

print(X)
print(X_transformed)

As you would see column CRIM has values which were in column AGE.

requirements used in code

arfs==1.0.7
bleach==6.0.0
bokeh==2.4.3
certifi==2022.12.7
charset-normalizer==3.1.0
cloudpickle==2.2.1
colorcet==3.0.1
contourpy==1.0.7
cycler==0.11.0
fonttools==4.39.3
holoviews==1.15.4
idna==3.4
importlib-metadata==6.6.0
importlib-resources==5.12.0
Jinja2==3.1.2
joblib==1.2.0
kiwisolver==1.4.4
lightgbm==3.3.3
llvmlite==0.40.0
Markdown==3.4.3
MarkupSafe==2.1.2
matplotlib==3.7.1
numba==0.57.0
numpy==1.21.6
packaging==23.1
pandas==1.5.1
panel==0.14.4
param==1.13.0
Pillow==9.5.0
pyct==0.5.0
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2023.3
pyviz-comms==2.2.1
PyYAML==6.0
requests==2.30.0
scikit-learn==1.2.0
scipy==1.8.1
seaborn==0.12.2
shap==0.41.0
six==1.16.0
slicer==0.0.7
threadpoolctl==3.1.0
tornado==6.3.1
tqdm==4.65.0
typing_extensions==4.5.0
tzdata==2023.3
urllib3==2.0.1
webencodings==0.5.1
zipp==3.15.0

I also tried to understand why such thing happens and figured out that this behavior caused by your implementation of transform method.

As you can see using return X[self.selected_features_] works strange

import pandas as pd
import numpy as np

from sklearn import set_config
from sklearn.feature_selection._base import SelectorMixin
from sklearn.base import BaseEstimator

set_config(transform_output="pandas")

class FeatureSelector_with_shuffled_output(SelectorMixin, BaseEstimator):
    def fit(self, X, y):
        self.feature_names_in_ = X.columns.to_numpy()
        random = np.random.RandomState(44)
        self.selected_features_ = random.choice(X.columns, X.shape[1] // 2, replace=False)
        self.support_ = np.array([c in self.selected_features_ for c in X.columns])
        return self
    
    def _get_support_mask(self):
        return self.support_
    
    def transform(self, X):
        if not isinstance(X, pd.DataFrame):
            raise ValueError("X needs to be pandas.DataFrame")
        return X[self.selected_features_]
    

class FeatureSelector(SelectorMixin, BaseEstimator):
    def fit(self, X, y):
        self.feature_names_in_ = X.columns.to_numpy()
        random = np.random.RandomState(44)
        self.selected_features_ = random.choice(X.columns, X.shape[1] // 2, replace=False)
        self.support_ = np.array([c in self.selected_features_ for c in X.columns])
        return self
    
    def _get_support_mask(self):
        return self.support_
    
    
X = pd.DataFrame({
    "a": np.random.randint(50, 100, 10),
    "b": np.random.randint(10, 20, 10),
    "c": np.random.randint(-100, -50, 10),
    "d": np.random.randint(-10, 0, 10)
})
y = pd.Series(np.random.rand(10))

fs = FeatureSelector()
print(fs.fit_transform(X, y))

fsw = FeatureSelector_with_shuffled_output()
print(fsw.fit_transform(X, y))

print(X)

Hope it will be helpful! If you have any questions I am open for discussion or adding some information.

Issue with Custom Callable Implementation in CollinearityThreshold Class

Title: Custom callable/function for CollinearityThreshold Class (nom_nom_assoc | num_num_assoc | nom_num_assoc)

Body:

Description of the Issue:
I encountered an error while trying to implement a custom callable for the CollinearityThreshold class, specifically when integrating the Predictive Power Score (PPS). The Code describes the implementation as follows: "If callable, a function which receives two pd.Series (and optionally a weight array) and returns a single number."

Code Sample:
I've implemented the PPS as follows:

def ppscore_arfs(x, y, **kwargs):
    """
    Calculate the Predictive Power Score (PPS) for series x with respect to series  y.

    Parameters:
        x (pandas.Series): A series representing a feature.
        y (pandas.Series): A series representing a feature.
        **kwargs: Additional keyword arguments for the ppscore function.

    Returns:
        float: A score representing the PPS between x and y.
    """
    import ppscore as pps

    # Merging x and y into a single DataFrame
    df = pd.concat([x, y], axis=1)
    
    # Calculating the PPS and extracting the score
    score = float(pps.score(df, df.columns[0], df.columns[1])['ppscore'])

    return score

I then applied this function in the CollinearityThreshold class as follows:

selector = CollinearityThreshold(
    method="association",
    nom_nom_assoc=ppscore_arfs,
    num_num_assoc=ppscore_arfs,
    nom_num_assoc=ppscore_arfs,
    threshold=0.85,
).fit(X, y)

Error Encountered:
Upon executing the above, I received the following error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[i:\Coding\00_Playground\arfs\test_arfs.ipynb](file:///I:/Coding/00_Playground/arfs/test_arfs.ipynb) Cell 20 line 7
      [1](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=0) selector = CollinearityThreshold(
      [2](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=1)     method="association",
      [3](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=2)     nom_nom_assoc=ppscore_arfs,
      [4](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=3)     num_num_assoc=ppscore_arfs,
      [5](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=4)     nom_num_assoc=ppscore_arfs,
      [6](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=5)     threshold=0.85,
----> [7](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=6) ).fit(X, y)

File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\arfs\feature_selection\unsupervised.py:349](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:349), in CollinearityThreshold.fit(self, X, y, sample_weight)
    [346](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:346)     X = encoder.fit_transform(X)
    [347](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:347)     del encoder
--> [349](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:349) assoc_matrix = association_matrix(
    [350](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:350)     X=X,
    [351](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:351)     sample_weight=sample_weight,
    [352](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:352)     n_jobs=self.n_jobs,
    [353](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:353)     nom_nom_assoc=self.nom_nom_assoc,
    [354](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:354)     num_num_assoc=self.num_num_assoc,
    [355](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:355)     nom_num_assoc=self.nom_num_assoc,
    [356](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:356) )
    [357](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:357) self.assoc_matrix_ = xy_to_matrix(assoc_matrix)
    [359](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:359) to_drop = _recursive_collinear_elimination(self.assoc_matrix_, self.threshold)

File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\arfs\association.py:1227](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1227), in association_matrix(X, sample_weight, nom_nom_assoc, num_num_assoc, nom_num_assoc, n_jobs, handle_na, nom_nom_comb, num_num_comb, nom_num_comb)
   [1225](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1225) if n_num_cols >= 2:
   [1226](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1226)     if callable(num_num_assoc):
-> [1227](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1227)         w_num_num = _callable_association_matrix_fn(
   [1228](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1228)             assoc_fn=num_num_assoc,
   [1229](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1229)             cols_comb=num_num_comb,
   [1230](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1230)             kind="num-num",
   [1231](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1231)             X=X,
   [1232](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1232)             sample_weight=sample_weight,
   [1233](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1233)             n_jobs=n_jobs,
   [1234](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1234)         )
   [1235](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1235)     else:
   [1236](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1236)         w_num_num = wcorr_matrix(
   [1237](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1237)             X, sample_weight, n_jobs, handle_na=None, method=num_num_assoc
   [1238](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1238)         )

File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\arfs\association.py:1426](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1426), in _callable_association_matrix_fn(assoc_fn, X, sample_weight, n_jobs, kind, cols_comb)
   [1424](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1424)     cols_comb = [comb for comb in combinations(selected_cols, 2)]
   [1425](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1425)     _assoc_fn = partial(_compute_matrix_entries, func_xyw=assoc_fn)
-> [1426](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1426)     assoc = parallel_matrix_entries(
   [1427](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1427)         func=_assoc_fn,
   [1428](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1428)         df=X,
   [1429](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1429)         comb_list=cols_comb,
   [1430](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1430)         sample_weight=sample_weight,
   [1431](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1431)         n_jobs=n_jobs,
   [1432](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1432)     )
   [1434](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1434) else:
   [1435](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1435)     assoc = None

File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\arfs\parallel.py:55](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:55), in parallel_matrix_entries(func, df, comb_list, sample_weight, n_jobs)
     [50](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:50) lst = Parallel(n_jobs=n_jobs)(
     [51](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:51)     delayed(func)(X=df, sample_weight=sample_weight, comb_list=comb_chunk)
     [52](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:52)     for comb_chunk in comb_chunks
     [53](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:53) )
     [54](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:54) # return flatten list of pandas DF
---> [55](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:55) return pd.concat(list(chain(*lst)), ignore_index=True)

File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\pandas\util\_decorators.py:331](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:331), in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    [325](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:325) if len(args) > num_allow_args:
    [326](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:326)     warnings.warn(
    [327](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:327)         msg.format(arguments=_format_argument_list(allow_args)),
    [328](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:328)         FutureWarning,
    [329](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:329)         stacklevel=find_stack_level(),
    [330](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:330)     )
--> [331](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:331) return func(*args, **kwargs)

File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\pandas\core\reshape\concat.py:368](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:368), in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    [146](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:146) @deprecate_nonkeyword_arguments(version=None, allowed_args=["objs"])
    [147](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:147) def concat(
    [148](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:148)     objs: Iterable[NDFrame] | Mapping[HashableT, NDFrame],
   (...)
    [157](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:157)     copy: bool = True,
    [158](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:158) ) -> DataFrame | Series:
    [159](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:159)     """
    [160](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:160)     Concatenate pandas objects along a particular axis.
    [161](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:161) 
   (...)
    [366](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:366)     1   3   4
    [367](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:367)     """
--> [368](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:368)     op = _Concatenator(
    [369](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:369)         objs,
    [370](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:370)         axis=axis,
    [371](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:371)         ignore_index=ignore_index,
    [372](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:372)         join=join,
    [373](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:373)         keys=keys,
    [374](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:374)         levels=levels,
    [375](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:375)         names=names,
    [376](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:376)         verify_integrity=verify_integrity,
    [377](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:377)         copy=copy,
    [378](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:378)         sort=sort,
    [379](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:379)     )
    [381](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:381)     return op.get_result()

File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\pandas\core\reshape\concat.py:458](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:458), in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    [453](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:453)     if not isinstance(obj, (ABCSeries, ABCDataFrame)):
    [454](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:454)         msg = (
    [455](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:455)             f"cannot concatenate object of type '{type(obj)}'; "
    [456](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:456)             "only Series and DataFrame objs are valid"
    [457](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:457)         )
--> [458](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:458)         raise TypeError(msg)
    [460](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:460)     ndims.add(obj.ndim)
    [462](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:462) # get the sample
    [463](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:463) # want the highest ndim that we have, and must be non-empty
    [464](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:464) # unless all objs are empty

TypeError: cannot concatenate object of type '<class 'float'>'; only Series and DataFrame objs are valid

Request for Assistance:
I am seeking guidance on how to resolve this error. It seems to be related to the way the ppscore_arfs function is implemented or how it's integrated with the CollinearityThreshold class. Any insights or suggestions on how to correctly implement this custom callable would be greatly appreciated.

Thank you in advance for your assistance!

Ability to pass in a model to GrootCV

It looks like it's impossible to pass in a model instance to GrootCV while it's possible to do it for other classes.
Is that intentional? We'd be happy to contribute with a PR.

make the charts consistent across the different classes

The charts for Leshy, BoostAGroota and GrootCV are not consistent. Making them render the same information will simplify the interpretation

GrootCV is missing class_weight param for muticlass classification

Thank you for this wonderful package. It must have been a lot of research and hard work to address so many issues with the older packages! I don't have the expertise to give you a PR but I did noted this:

LightGBM has a class_weight parameter for unbalanced classes that seems to be missing in GrootCV. One can set the objective to multiclass, but there is then no way to enter the corresponding class_weight parameter, resulting in LightGBM giving a warning that the classes are unbalanced.

Update dependency

Please update dependency to pandas >=1.40 so on apple silicon we can use wheel for python >= 3.8

Cannot suppress runtime warning

I keep getting the warning below when using the pipeline filtering methods but it cannot be suppressed using np.seterr() or a warning filter

/lib/python3.10/site-packages/arfs/association.py:715: RuntimeWarning: invalid value encountered in scalar divide
  return wcov(x, y, w)/ np.sqrt(wcov(x, x, w) * wcov(y, y, w))

Citing the library?

Hello Thomas

Thanks for a great library!

I am getting good results with the GrootCV procedure and was wondering if you have any work published on this algorithm?
I am interested in using it in a study and it would be great to have something to cite :)
Otherwise how would you like the library to be cited?

Downsampling

Hi Thomas. Thank you for open sourcing this library, it's very useful.

I found your idea of using Isolation Forest for downsampling the observations passed to SHAP very interesting. I'm wondering if you also tried HDBSCAN clustering. It is used, for example, as the default downsampling method in the interpret_community library and it has the advantage of automatically selecting the optimal number of clusters (i.e., the optimal number of samples to better represent the dataset). There is a Python implementation of HDBSCAN in [in this package] (https://hdbscan.readthedocs.io/en/latest/).

GrootCV: Extracting average SHAP over all iterations per sample in addition to per feature

I'd like to use GrootCV not only for feature selection, but also for data mining. Specifically, what I'm lacking is a way to see how each feature influenced the prediction for each sample over the iterations. I have been using SHAP for that purpose until now, but the estimates can be a little noisy, or at least I fear so.

Similarly to how I can get the full history of overall importance per feature using GrootCV.cv_df - would it be possible to also surface the impact per feature, per sample, per iteration? That way I can build statistical confidence on each feature's impact on each sample.

LightGBMError: Number of classes should be specified and greater than 1 for multiclass training

When using feat_selector = arfsrel.GrootCV(objective='multiclass', cutoff=1, n_folds=5, n_iter=5, silent=False)

I get
LightGBMError: Number of classes should be specified and greater than 1 for multiclass training

Thank you for this package!

[BUG] User-Specified Threshold for CollinearityThreshold is not Applied.

Hey, I was looking through the code for ColinearityThreshold for a similar method I've been putting together and I noticed that in _recursive_collinear_elimination, the user specified collinearity threshold is not propagated into the while loop. I've highlighted the issue with my comment below:

def _recursive_collinear_elimination(association_matrix, threshold):
    dum = association_matrix.copy()
    most_collinear_features = []
    most_collinear_feature, to_drop = _most_collinear(association_matrix, threshold)
    most_collinear_features.append(most_collinear_feature)
    dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)

    while len(to_drop) > 1:
        most_collinear_feature, to_drop = _most_collinear(dum, 0.75) # Should be threshold instead of 0.75
        most_collinear_features.append(most_collinear_feature)
        dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)
    return most_collinear_features

Thanks for this fantastic package by the way.

Leshy works wrong with categorical features

Hello, when I was using Leshy with catboost estimator and dataset that has categorical features, I've noticed that all features of my dataset are considered as categorical and passed to cat_features parameter of catboost. This is caused by this line

arfs/src/arfs/feature_selection/allrelevant.py

Line 946 in ca71fb2

X = pd.DataFrame(X)

If you have X = np.array([[1, 2, 'a'], [3, 4, 'b']]) and you pass it to pd.DataFrame then it will make dtypes of each columns equals to object. I propose using original method for creating shadow features. It keeps original dtypes of columns

def _create_shadow(x_train):
    """
    Take all X variables, creating copies and randomly shuffling them
    :param x_train: the dataframe to create shadow features on
    :return: dataframe 2x width and the names of the shadows for removing later
    """
    x_shadow = x_train.copy()
    for c in x_shadow.columns:
        np.random.shuffle(x_shadow[c].values)
    # rename the shadow
    shadow_names = ["ShadowVar" + str(i + 1) for i in range(x_train.shape[1])]
    x_shadow.columns = shadow_names
    # Combine to make one new dataframe
    new_x = pd.concat([x_train, x_shadow], axis=1)
    return new_x, shadow_names

Update the lightGBM callback in the ARFS package

lightGBM changed the way it is performing early stopping, logging, etc by introducing new callbacks. Updating ARFS for sticking to the new lightgbm API

potential to specify time series splitter

Hi! Thank you for the super useful library!
We'd love to use it on time series tasks, but for that, we'd need the internal splitter to respect the temporal dimension of the data.
We're happy to contribute a PR that makes this possible.
Is there anything specific regarding the public API that we should keep in mind of, in order for our PR to be accepted?

Thank you!

How to get MRMR into a cross-validation pipeline?

I'd like to get arfs.feature_selection.MinRedundancyMaxRelevance into a cross-validation pipeline which tunes n_features_to_select. In other words, I'd like to do something like is done in this example for PCA:

from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)

pipe = Pipeline([('pca', PCA()), ('svc', SVC())])

param_grid = {
    'pca__n_components': [2, 5, 10],
    'svc__C': [0.1, 1, 10],
    'svc__kernel': ['linear', 'rbf']
}

grid = GridSearchCV(pipe, param_grid=param_grid)
grid.fit(X_train, y_train)

print("Best parameter (CV score=%0.3f):" % grid.best_score_)
print(grid.best_params_)

The problem is that the MinRedundancyMaxRelevance() class constructor requires that n_features_to_select be specified at initialization; I can't instantiate with no arguments, like with PCA() above, and that means, as far as I can tell, that I can't tune n_features_to_select within a pipeline.

But am I missing something? Is there a different way to tune n_features_to_select within a pipeline?

Leshy fit method always overwrites to importance==shap if fasttreeshap not installed

The Leshy fit method has a try except block which always overrights the importance to "shap" if fasttreeshap is not installed. This is blocking "pimp" for beeing selected:

https://github.com/ThomasBury/arfs/blob/main/src/arfs/feature_selection/allrelevant.py#L317

try:
    from fasttreeshap import TreeExplainer as FastTreeExplainer
except ImportError:
    warnings.warn("fasttreeshap is not installed. Fallback to shap.")
    self.importance = "shap"

this should be:

if self.importance == "fastshap":
    try:
        from fasttreeshap import TreeExplainer as FastTreeExplainer
    except ImportError:
        warnings.warn("fasttreeshap is not installed. Fallback to shap.")
        self.importance = "shap"

LightGBM bump and folds var

Thanks for a great library! I'm just starting to look through the code more in-depth now. Given that Optuna v3.3.0 supports LightGBM v4.0.0, is it possible to bump those in the setup.py? I see that as the reason for the downgrade to 3.3.1 here, https://github.com/ThomasBury/arfs/releases/tag/2.0.6. LightGBM 4.0 adds some nice to have, such as native GPU support. However, FastTreeSHAP doesn't support past version 3.3.5 atm; so that would need to default to False; otherwise, you'd get the following error: linkedin/FastTreeSHAP#19. Also, out of curiosity, how hard would it be to allow the user to send in their own folds like Optuna's LGBM implementation, https://optuna.readthedocs.io/en/stable/reference/generated/optuna.integration.lightgbm.LightGBMTunerCV.html? I work on a variety of different problem domains and time series is one of them and the problem in which I'm currently trying to solve.

provide stratification in the cross-validation processes

Bug: MinRedundancyMaxRelevance Function Modifies Input DataFrame by Adding target Column

Description

The MinRedundancyMaxRelevancy class adds the target column to X.

Steps to Reproduce

Import the MinRedundancyMaxRelevancy class
Create a DataFrame X and a target series y.
Call the MinRedundancyMaxRelevancy fit function with X and y as inputs.
Observe that the target column is unexpectedly added to DataFrame X.

Expected Behavior

The ``MinRedundancyMaxRelevancyfit method should compute the necessary values and return them without modifying the input DataFrameX`.

Actual Behavior

The input DataFrame X is being modified by having the target column appended to it after calling MinRedundancyMaxRelevancy fit method.

Numba

This is minor, but I wanted to note it in case you haven't seen it yourself. I'm getting this warning when I run GrootCV:

The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.

Consider using FastTreeSHAP?

https://engineering.linkedin.com/blog/2022/fasttreeshap--accelerating-shap-value-computation-for-trees
It looks like it's a drop-in replacement, which we'll start testing next week!

fix bug in shap feature importance when task is classification and estimator is lightGBM

The predict_contrib output shape of lightGBM has changed and therefore the computed shap feature importance in the case "classification + lightGBM" is wrong.

	def __init__(
	self,
	threshold=0.80,
	method="association",
	n_jobs=1,
	nom_nom_assoc=weighted_theils_u,
	num_num_assoc=weighted_corr,
	nom_num_assoc=correlation_ratio,
	):

	def _most_collinear(association_matrix, threshold):
	cols_to_drop = [
	column
	for column in association_matrix.columns
	if any(association_matrix.loc[:, column].abs() > threshold)
	]
	rows_to_drop = [
	row
	for row in association_matrix.index
	if any(association_matrix.loc[row, :].abs() > threshold)
	]
	to_drop = list(set(cols_to_drop).union(set(rows_to_drop)))
	most_collinear_series = (
	association_matrix[to_drop].abs().sum(axis=1).sort_values(ascending=False)
	)
	most_collinear_series += (
	association_matrix[to_drop].abs().sum(axis=0).sort_values(ascending=False)
	)
	most_collinear_series /= 2
	return most_collinear_series.index[0], to_drop


	def _recursive_collinear_elimination(association_matrix, threshold):
	dum = association_matrix.copy()
	most_collinear_features = []

	while True:
	most_collinear_feature, to_drop = _most_collinear(dum, threshold)

	# Break if no more features to drop
	if not to_drop:
	break

	if most_collinear_feature not in most_collinear_features:
	most_collinear_features.append(most_collinear_feature)
	dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)

	return most_collinear_features

thomasbury / arfs Goto Github PK

arfs's Introduction

arfs's People

Contributors

Stargazers

Watchers

Forkers

arfs's Issues

Description

Problem

Expected Behavior

Current Behavior

Steps to Reproduce

Suggested Fix

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Recommend Projects

Recommend Topics

Recommend Org