thomasbury / arfs Goto Github PK
View Code? Open in Web Editor NEWAll Relevant Feature Selection
License: MIT License
All Relevant Feature Selection
License: MIT License
ModuleNotFoundError: no module named feature_selection
arfs/src/arfs/feature_selection/unsupervised.py
Lines 302 to 310 in bbcc785
On L308, the default is weighted_corr
, but weighted_corr
defaults to 'pearsonnot
spearman` like the documentation says it is.
I also double check it compared to scipy.stats.spearmanr
and they are not equal.
LightForestRegressor
and LightForestClassifier
I can't wait trying this package for my own projects.
Hello, I think I found a possible bug in src/arfs/feature_selection/allrelevant.py, in function _reduce_vars_sklearn
:
# Get mean value from the shadows (max, like in Boruta, median to mitigate variance)
mean_shadow = (
shadow_vars.select_dtypes(include=[np.number])
.max(skipna=True)
.mean(skipna=True)
/ cutoff
)
The bug is about max
, which should have axis=1
. Without it, It takes the maximum shap of all features, for each iteration. But I would expect the contrary, which is the maximum shap of all iteration, for each feature.
I may be wrong here, but I had to ask. Thank you!
Side note: probably that comment above the function is outdated :)
When using leshy, it works normally, but it can't print out important pictures
Hi Thanks for writing such a great module!
After running CollinearityThreshold.fit_transform()
on some of my data, I was trying to look into which unselected features are collinear with my selected features. I was trying to look at the assoc_matrix_
which told me that 502/643 had no values above my threshold. This is in contrast to the number of selected features which was only 231/643. Spot checking some of the not selected features showed that they also never had a value above the threshold. This then led me to the code for dropping features and I am a little confused by it.
arfs/src/arfs/feature_selection/unsupervised.py
Lines 426 to 463 in bbcc785
In lines 438-444, it looks like you are trying to sum the row and column for a given feature, and return the feature with the highest average, correct? However, association_matrix[to_drop] == association_matrix.loc[:,to_drop]
so in L439 your index would be all features instead of just features into_drop
.
Second, in L439 and L442 you sort the series, but I believe you should just be doing a final sort in L444 or L445.
>>> series = pd.Series([1, 0], index=['A', 'B'])
>>> series += pd.Series([1.4, 0.1], index=['B', 'A'])
>>> print(series)
A 1.1
B 1.4
dtype: float64
Combined, these two things seem to result in the incorrect feature being dropped.
Related, but not a bug, I found changing L427-L436 to the following resulted in a huge speedup (998 ms ± 26.7 ms per loop vs 27.3 s ± 558 ms per loop) in calling _recursive_collinear_elimination
for me (n_features=643).
cols_to_drop = association_matrix.loc[
:, (association_matrix.abs() > threshold).any(axis=0)
].columns.values
rows_to_drop = association_matrix.loc[
(association_matrix.abs() > threshold).any(axis=1), :
].index.values
Hi I wonder if it is necessary to limit the tree depth, as in boruta, and how much this parameter (and in general all LGBM settings, leaves, boosting rounds...) could impact BoostAGroota. Is it worth including in the doc ?
Thanks, great project
The CollinearityThreshold
class in our codebase is intended to remove collinear features from datasets. However, it appears to be dropping features that do not meet the specified collinearity threshold, leading to the potential loss of important data. An example of this issue is the unwarranted removal of the 'age' column in the titanic dataset provided in the examples, where the association values are below the set threshold.
The class should only remove features that are collinear above the specified threshold. Features with association values below this threshold should be retained in the dataset.
The class is removing features that do not exceed the collinearity threshold. This behavior is observed in the recursive feature elimination process, where features are being dropped inappropriately.
CollinearityThreshold
with a specific threshold.Modify the _recursive_collinear_elimination
method to ensure it accurately removes only those features that exceed the specified collinearity threshold. The proposed change includes adding a condition to break the while loop when no more features exceed the threshold, preventing the unnecessary removal of features.
Old Version
def _recursive_collinear_elimination(association_matrix, threshold):
dum = association_matrix.copy()
most_collinear_features = []
most_collinear_feature, to_drop = _most_collinear(association_matrix, threshold)
most_collinear_features.append(most_collinear_feature)
dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)
while len(to_drop) > 1:
most_collinear_feature, to_drop = _most_collinear(dum, threshold)
most_collinear_features.append(most_collinear_feature)
dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)
return most_collinear_features
New Version
def _recursive_collinear_elimination(association_matrix, threshold):
dum = association_matrix.copy()
most_collinear_features = []
while True:
most_collinear_feature, to_drop = _most_collinear(dum, threshold)
# Break if no more features to drop
if not to_drop:
break
if most_collinear_feature not in most_collinear_features:
most_collinear_features.append(most_collinear_feature)
dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)
return most_collinear_features
to prevent passing a single categorical column to the nom-nom association (Theil and Cramer matrix)
Hello, I was analyzing the source code of BoostAGroota, and I noticed a couple of strange things. I will show the first here, and the other in another issue to keep things tidy.
In src/arfs/feature_selection/allrelevant.py in function _reduce_vars_sklearn
there is this piece of code:
if i == 1:
df = pd.DataFrame({"feature": new_x.columns})
df2 = df.copy()
# Store the feature importances in df2
try:
# Normalize the feature importances
df2["fscore" + str(i)] = importance / importance.sum()
except ValueError:
print("Only Sklearn tree based methods allowed")
# Merge the current feature importances with the existing ones in df
df = pd.merge(
df, df2, on="feature", how="outer", suffixes=("_x" + str(i), "_y" + str(i))
)
The issue is about the pd.merge
: I noticed that, over n_iterations
, it duplicates the feature importances of previous iterations because the same importance is present in both df
and df2
.
After n_iterations
, I expect a dataframe of importance data of shape len(real_vars) + len(shadow_vars)
x n_iterations
, but instead the number of columns is way higher due to the pd.merge
. As a result, the average importances calculated withdf["Mean"] = df.select_dtypes(include=[np.number]).mean(axis=1, skipna=True)
will be slightly biased by the repeated columns.
You can check this behavior by comparing df
and df2
right after the for
loop. You will see that df
has way too many columns, while df2
has the correct number of columns.
I hope I have described the problem well enough. Let me know what you think.
Hello, I've noticed that if you use set_config(transform_output="pandas") your BoostAGroota.transform methods works wrong. It shuffles columns of pandas DataFrame(which left after feature selection).
There is code snipper for reproduction of this problem.
import warnings
warnings.filterwarnings('ignore')
from sklearn import set_config
from lightgbm import LGBMRegressor
import arfs.feature_selection.allrelevant as arfsgroot
from arfs.utils import load_data
set_config(transform_output='pandas')
boston = load_data(name="Boston")
X, y = boston.data, boston.target
fs = arfsgroot.BoostAGroota(LGBMRegressor(n_estimators=1, random_state=42))
X_transformed = fs.fit_transform(X, y)
print(X)
print(X_transformed)
As you would see column CRIM has values which were in column AGE.
requirements used in code
arfs==1.0.7
bleach==6.0.0
bokeh==2.4.3
certifi==2022.12.7
charset-normalizer==3.1.0
cloudpickle==2.2.1
colorcet==3.0.1
contourpy==1.0.7
cycler==0.11.0
fonttools==4.39.3
holoviews==1.15.4
idna==3.4
importlib-metadata==6.6.0
importlib-resources==5.12.0
Jinja2==3.1.2
joblib==1.2.0
kiwisolver==1.4.4
lightgbm==3.3.3
llvmlite==0.40.0
Markdown==3.4.3
MarkupSafe==2.1.2
matplotlib==3.7.1
numba==0.57.0
numpy==1.21.6
packaging==23.1
pandas==1.5.1
panel==0.14.4
param==1.13.0
Pillow==9.5.0
pyct==0.5.0
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2023.3
pyviz-comms==2.2.1
PyYAML==6.0
requests==2.30.0
scikit-learn==1.2.0
scipy==1.8.1
seaborn==0.12.2
shap==0.41.0
six==1.16.0
slicer==0.0.7
threadpoolctl==3.1.0
tornado==6.3.1
tqdm==4.65.0
typing_extensions==4.5.0
tzdata==2023.3
urllib3==2.0.1
webencodings==0.5.1
zipp==3.15.0
I also tried to understand why such thing happens and figured out that this behavior caused by your implementation of transform method.
As you can see using return X[self.selected_features_] works strange
import pandas as pd
import numpy as np
from sklearn import set_config
from sklearn.feature_selection._base import SelectorMixin
from sklearn.base import BaseEstimator
set_config(transform_output="pandas")
class FeatureSelector_with_shuffled_output(SelectorMixin, BaseEstimator):
def fit(self, X, y):
self.feature_names_in_ = X.columns.to_numpy()
random = np.random.RandomState(44)
self.selected_features_ = random.choice(X.columns, X.shape[1] // 2, replace=False)
self.support_ = np.array([c in self.selected_features_ for c in X.columns])
return self
def _get_support_mask(self):
return self.support_
def transform(self, X):
if not isinstance(X, pd.DataFrame):
raise ValueError("X needs to be pandas.DataFrame")
return X[self.selected_features_]
class FeatureSelector(SelectorMixin, BaseEstimator):
def fit(self, X, y):
self.feature_names_in_ = X.columns.to_numpy()
random = np.random.RandomState(44)
self.selected_features_ = random.choice(X.columns, X.shape[1] // 2, replace=False)
self.support_ = np.array([c in self.selected_features_ for c in X.columns])
return self
def _get_support_mask(self):
return self.support_
X = pd.DataFrame({
"a": np.random.randint(50, 100, 10),
"b": np.random.randint(10, 20, 10),
"c": np.random.randint(-100, -50, 10),
"d": np.random.randint(-10, 0, 10)
})
y = pd.Series(np.random.rand(10))
fs = FeatureSelector()
print(fs.fit_transform(X, y))
fsw = FeatureSelector_with_shuffled_output()
print(fsw.fit_transform(X, y))
print(X)
Hope it will be helpful! If you have any questions I am open for discussion or adding some information.
Title: Custom callable/function for CollinearityThreshold
Class (nom_nom_assoc | num_num_assoc | nom_num_assoc)
Body:
Description of the Issue:
I encountered an error while trying to implement a custom callable for the CollinearityThreshold
class, specifically when integrating the Predictive Power Score (PPS). The Code describes the implementation as follows: "If callable, a function which receives two pd.Series
(and optionally a weight array) and returns a single number."
Code Sample:
I've implemented the PPS as follows:
def ppscore_arfs(x, y, **kwargs):
"""
Calculate the Predictive Power Score (PPS) for series x with respect to series y.
Parameters:
x (pandas.Series): A series representing a feature.
y (pandas.Series): A series representing a feature.
**kwargs: Additional keyword arguments for the ppscore function.
Returns:
float: A score representing the PPS between x and y.
"""
import ppscore as pps
# Merging x and y into a single DataFrame
df = pd.concat([x, y], axis=1)
# Calculating the PPS and extracting the score
score = float(pps.score(df, df.columns[0], df.columns[1])['ppscore'])
return score
I then applied this function in the CollinearityThreshold
class as follows:
selector = CollinearityThreshold(
method="association",
nom_nom_assoc=ppscore_arfs,
num_num_assoc=ppscore_arfs,
nom_num_assoc=ppscore_arfs,
threshold=0.85,
).fit(X, y)
Error Encountered:
Upon executing the above, I received the following error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
[i:\Coding\00_Playground\arfs\test_arfs.ipynb](file:///I:/Coding/00_Playground/arfs/test_arfs.ipynb) Cell 20 line 7
[1](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=0) selector = CollinearityThreshold(
[2](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=1) method="association",
[3](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=2) nom_nom_assoc=ppscore_arfs,
[4](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=3) num_num_assoc=ppscore_arfs,
[5](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=4) nom_num_assoc=ppscore_arfs,
[6](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=5) threshold=0.85,
----> [7](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=6) ).fit(X, y)
File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\arfs\feature_selection\unsupervised.py:349](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:349), in CollinearityThreshold.fit(self, X, y, sample_weight)
[346](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:346) X = encoder.fit_transform(X)
[347](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:347) del encoder
--> [349](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:349) assoc_matrix = association_matrix(
[350](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:350) X=X,
[351](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:351) sample_weight=sample_weight,
[352](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:352) n_jobs=self.n_jobs,
[353](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:353) nom_nom_assoc=self.nom_nom_assoc,
[354](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:354) num_num_assoc=self.num_num_assoc,
[355](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:355) nom_num_assoc=self.nom_num_assoc,
[356](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:356) )
[357](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:357) self.assoc_matrix_ = xy_to_matrix(assoc_matrix)
[359](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:359) to_drop = _recursive_collinear_elimination(self.assoc_matrix_, self.threshold)
File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\arfs\association.py:1227](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1227), in association_matrix(X, sample_weight, nom_nom_assoc, num_num_assoc, nom_num_assoc, n_jobs, handle_na, nom_nom_comb, num_num_comb, nom_num_comb)
[1225](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1225) if n_num_cols >= 2:
[1226](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1226) if callable(num_num_assoc):
-> [1227](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1227) w_num_num = _callable_association_matrix_fn(
[1228](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1228) assoc_fn=num_num_assoc,
[1229](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1229) cols_comb=num_num_comb,
[1230](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1230) kind="num-num",
[1231](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1231) X=X,
[1232](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1232) sample_weight=sample_weight,
[1233](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1233) n_jobs=n_jobs,
[1234](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1234) )
[1235](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1235) else:
[1236](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1236) w_num_num = wcorr_matrix(
[1237](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1237) X, sample_weight, n_jobs, handle_na=None, method=num_num_assoc
[1238](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1238) )
File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\arfs\association.py:1426](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1426), in _callable_association_matrix_fn(assoc_fn, X, sample_weight, n_jobs, kind, cols_comb)
[1424](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1424) cols_comb = [comb for comb in combinations(selected_cols, 2)]
[1425](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1425) _assoc_fn = partial(_compute_matrix_entries, func_xyw=assoc_fn)
-> [1426](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1426) assoc = parallel_matrix_entries(
[1427](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1427) func=_assoc_fn,
[1428](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1428) df=X,
[1429](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1429) comb_list=cols_comb,
[1430](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1430) sample_weight=sample_weight,
[1431](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1431) n_jobs=n_jobs,
[1432](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1432) )
[1434](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1434) else:
[1435](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1435) assoc = None
File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\arfs\parallel.py:55](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:55), in parallel_matrix_entries(func, df, comb_list, sample_weight, n_jobs)
[50](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:50) lst = Parallel(n_jobs=n_jobs)(
[51](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:51) delayed(func)(X=df, sample_weight=sample_weight, comb_list=comb_chunk)
[52](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:52) for comb_chunk in comb_chunks
[53](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:53) )
[54](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:54) # return flatten list of pandas DF
---> [55](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:55) return pd.concat(list(chain(*lst)), ignore_index=True)
File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\pandas\util\_decorators.py:331](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:331), in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
[325](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:325) if len(args) > num_allow_args:
[326](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:326) warnings.warn(
[327](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:327) msg.format(arguments=_format_argument_list(allow_args)),
[328](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:328) FutureWarning,
[329](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:329) stacklevel=find_stack_level(),
[330](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:330) )
--> [331](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:331) return func(*args, **kwargs)
File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\pandas\core\reshape\concat.py:368](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:368), in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
[146](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:146) @deprecate_nonkeyword_arguments(version=None, allowed_args=["objs"])
[147](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:147) def concat(
[148](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:148) objs: Iterable[NDFrame] | Mapping[HashableT, NDFrame],
(...)
[157](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:157) copy: bool = True,
[158](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:158) ) -> DataFrame | Series:
[159](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:159) """
[160](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:160) Concatenate pandas objects along a particular axis.
[161](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:161)
(...)
[366](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:366) 1 3 4
[367](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:367) """
--> [368](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:368) op = _Concatenator(
[369](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:369) objs,
[370](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:370) axis=axis,
[371](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:371) ignore_index=ignore_index,
[372](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:372) join=join,
[373](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:373) keys=keys,
[374](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:374) levels=levels,
[375](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:375) names=names,
[376](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:376) verify_integrity=verify_integrity,
[377](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:377) copy=copy,
[378](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:378) sort=sort,
[379](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:379) )
[381](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:381) return op.get_result()
File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\pandas\core\reshape\concat.py:458](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:458), in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
[453](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:453) if not isinstance(obj, (ABCSeries, ABCDataFrame)):
[454](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:454) msg = (
[455](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:455) f"cannot concatenate object of type '{type(obj)}'; "
[456](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:456) "only Series and DataFrame objs are valid"
[457](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:457) )
--> [458](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:458) raise TypeError(msg)
[460](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:460) ndims.add(obj.ndim)
[462](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:462) # get the sample
[463](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:463) # want the highest ndim that we have, and must be non-empty
[464](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:464) # unless all objs are empty
TypeError: cannot concatenate object of type '<class 'float'>'; only Series and DataFrame objs are valid
Request for Assistance:
I am seeking guidance on how to resolve this error. It seems to be related to the way the ppscore_arfs
function is implemented or how it's integrated with the CollinearityThreshold
class. Any insights or suggestions on how to correctly implement this custom callable would be greatly appreciated.
Thank you in advance for your assistance!
It looks like it's impossible to pass in a model instance to GrootCV while it's possible to do it for other classes.
Is that intentional? We'd be happy to contribute with a PR.
The charts for Leshy, BoostAGroota and GrootCV are not consistent. Making them render the same information will simplify the interpretation
Thank you for this wonderful package. It must have been a lot of research and hard work to address so many issues with the older packages! I don't have the expertise to give you a PR but I did noted this:
LightGBM has a class_weight parameter for unbalanced classes that seems to be missing in GrootCV. One can set the objective to multiclass, but there is then no way to enter the corresponding class_weight parameter, resulting in LightGBM giving a warning that the classes are unbalanced.
Please update dependency to pandas >=1.40 so on apple silicon we can use wheel for python >= 3.8
I keep getting the warning below when using the pipeline filtering methods but it cannot be suppressed using np.seterr() or a warning filter
/lib/python3.10/site-packages/arfs/association.py:715: RuntimeWarning: invalid value encountered in scalar divide
return wcov(x, y, w)/ np.sqrt(wcov(x, x, w) * wcov(y, y, w))
Hello Thomas
Thanks for a great library!
I am getting good results with the GrootCV procedure and was wondering if you have any work published on this algorithm?
I am interested in using it in a study and it would be great to have something to cite :)
Otherwise how would you like the library to be cited?
Hi Thomas. Thank you for open sourcing this library, it's very useful.
I found your idea of using Isolation Forest for downsampling the observations passed to SHAP very interesting. I'm wondering if you also tried HDBSCAN clustering. It is used, for example, as the default downsampling method in the interpret_community library and it has the advantage of automatically selecting the optimal number of clusters (i.e., the optimal number of samples to better represent the dataset). There is a Python implementation of HDBSCAN in [in this package] (https://hdbscan.readthedocs.io/en/latest/).
I'd like to use GrootCV not only for feature selection, but also for data mining. Specifically, what I'm lacking is a way to see how each feature influenced the prediction for each sample over the iterations. I have been using SHAP for that purpose until now, but the estimates can be a little noisy, or at least I fear so.
Similarly to how I can get the full history of overall importance per feature using GrootCV.cv_df - would it be possible to also surface the impact per feature, per sample, per iteration? That way I can build statistical confidence on each feature's impact on each sample.
When using feat_selector = arfsrel.GrootCV(objective='multiclass', cutoff=1, n_folds=5, n_iter=5, silent=False)
I get
LightGBMError: Number of classes should be specified and greater than 1 for multiclass training
Thank you for this package!
Hey, I was looking through the code for ColinearityThreshold for a similar method I've been putting together and I noticed that in _recursive_collinear_elimination, the user specified collinearity threshold is not propagated into the while loop. I've highlighted the issue with my comment below:
def _recursive_collinear_elimination(association_matrix, threshold):
dum = association_matrix.copy()
most_collinear_features = []
most_collinear_feature, to_drop = _most_collinear(association_matrix, threshold)
most_collinear_features.append(most_collinear_feature)
dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)
while len(to_drop) > 1:
most_collinear_feature, to_drop = _most_collinear(dum, 0.75) # Should be threshold instead of 0.75
most_collinear_features.append(most_collinear_feature)
dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)
return most_collinear_features
Thanks for this fantastic package by the way.
Hello, when I was using Leshy with catboost estimator and dataset that has categorical features, I've noticed that all features of my dataset are considered as categorical and passed to cat_features parameter of catboost. This is caused by this line
If you have X = np.array([[1, 2, 'a'], [3, 4, 'b']])
and you pass it to pd.DataFrame then it will make dtypes of each columns equals to object
. I propose using original method for creating shadow features. It keeps original dtypes of columns
def _create_shadow(x_train):
"""
Take all X variables, creating copies and randomly shuffling them
:param x_train: the dataframe to create shadow features on
:return: dataframe 2x width and the names of the shadows for removing later
"""
x_shadow = x_train.copy()
for c in x_shadow.columns:
np.random.shuffle(x_shadow[c].values)
# rename the shadow
shadow_names = ["ShadowVar" + str(i + 1) for i in range(x_train.shape[1])]
x_shadow.columns = shadow_names
# Combine to make one new dataframe
new_x = pd.concat([x_train, x_shadow], axis=1)
return new_x, shadow_names
lightGBM changed the way it is performing early stopping, logging, etc by introducing new callbacks. Updating ARFS for sticking to the new lightgbm API
Hi! Thank you for the super useful library!
We'd love to use it on time series tasks, but for that, we'd need the internal splitter to respect the temporal dimension of the data.
We're happy to contribute a PR that makes this possible.
Is there anything specific regarding the public API that we should keep in mind of, in order for our PR to be accepted?
Thank you!
I'd like to get arfs.feature_selection.MinRedundancyMaxRelevance
into a cross-validation pipeline which tunes n_features_to_select
. In other words, I'd like to do something like is done in this example for PCA:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)
pipe = Pipeline([('pca', PCA()), ('svc', SVC())])
param_grid = {
'pca__n_components': [2, 5, 10],
'svc__C': [0.1, 1, 10],
'svc__kernel': ['linear', 'rbf']
}
grid = GridSearchCV(pipe, param_grid=param_grid)
grid.fit(X_train, y_train)
print("Best parameter (CV score=%0.3f):" % grid.best_score_)
print(grid.best_params_)
The problem is that the MinRedundancyMaxRelevance() class constructor requires that n_features_to_select
be specified at initialization; I can't instantiate with no arguments, like with PCA()
above, and that means, as far as I can tell, that I can't tune n_features_to_select
within a pipeline.
But am I missing something? Is there a different way to tune n_features_to_select
within a pipeline?
The Leshy fit method has a try except block which always overrights the importance to "shap" if fasttreeshap is not installed. This is blocking "pimp" for beeing selected:
https://github.com/ThomasBury/arfs/blob/main/src/arfs/feature_selection/allrelevant.py#L317
try:
from fasttreeshap import TreeExplainer as FastTreeExplainer
except ImportError:
warnings.warn("fasttreeshap is not installed. Fallback to shap.")
self.importance = "shap"
this should be:
if self.importance == "fastshap":
try:
from fasttreeshap import TreeExplainer as FastTreeExplainer
except ImportError:
warnings.warn("fasttreeshap is not installed. Fallback to shap.")
self.importance = "shap"
Thanks for a great library! I'm just starting to look through the code more in-depth now. Given that Optuna v3.3.0 supports LightGBM v4.0.0, is it possible to bump those in the setup.py? I see that as the reason for the downgrade to 3.3.1 here, https://github.com/ThomasBury/arfs/releases/tag/2.0.6. LightGBM 4.0 adds some nice to have, such as native GPU support. However, FastTreeSHAP doesn't support past version 3.3.5 atm; so that would need to default to False; otherwise, you'd get the following error: linkedin/FastTreeSHAP#19. Also, out of curiosity, how hard would it be to allow the user to send in their own folds like Optuna's LGBM implementation, https://optuna.readthedocs.io/en/stable/reference/generated/optuna.integration.lightgbm.LightGBMTunerCV.html? I work on a variety of different problem domains and time series is one of them and the problem in which I'm currently trying to solve.
The MinRedundancyMaxRelevancy
class adds the target
column to X
.
MinRedundancyMaxRelevancy
classX
and a target series y
.MinRedundancyMaxRelevancy
fit function with X
and y
as inputs.target
column is unexpectedly added to DataFrame X
.The ``MinRedundancyMaxRelevancyfit method should compute the necessary values and return them without modifying the input DataFrame
X`.
The input DataFrame X
is being modified by having the target
column appended to it after calling MinRedundancyMaxRelevancy
fit method.
This is minor, but I wanted to note it in case you haven't seen it yourself. I'm getting this warning when I run GrootCV:
The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
https://engineering.linkedin.com/blog/2022/fasttreeshap--accelerating-shap-value-computation-for-trees
It looks like it's a drop-in replacement, which we'll start testing next week!
The predict_contrib
output shape of lightGBM has changed and therefore the computed shap feature importance in the case "classification + lightGBM" is wrong.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.