cerlymarco / shap-hypetune Goto Github PK
View Code? Open in Web Editor NEWA python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models.
License: MIT License
A python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models.
License: MIT License
If parameter tuning is not being used, evaluation metric is not printed.
Would you consider adding events/callbacks so that custom functions can be called for progress logging / inspecting / interrupting on condition?
Thanks for sharing your work @cerlymarco!
Hi there! First off, kudos for creating such a brilliant package! I came across a minor issue when using hyperopt tuning – an error popped up and I think it is due to the deprecation of np.random.RandomState
.
I tested the above hypothesis locally and indeed the hyperopt bit works well after updating hyperopt version and replacing np.random.RandomState
. The tested changes can be found in this PR for your kind review: #27
Thanks again for your amazing work!
Hi! Could you explain please, why we should use feature_perturbation="tree_path_dependent" in shap importances calculation procedure? (Not “interventional”)
Hello,
I have an unbalanced dataset and I am trying to create a custom scorer that finds the best possible recall above a given precision for the minority class.
The opposite seems to work well. When I feed the following score to shap-hypertune, it produces consistent results for the precision:
def precision_at_recall(y_true, y_hat):
precision, recall, thresholds = precision_recall_curve(y_true, y_hat, pos_label=1)
ix = np.argmax(precision[recall >= .9])
return 'precision_at_recall', precision[ix], True
The recall and precision for the minority class at a threshold of 0.5 are both around 0.85. If we set a recall above 0.9, the precision decreases accordingly, as expected.
However, the following does not work:
def recall_at_precision(y_true, y_hat):
precision, recall, thresholds = precision_recall_curve(y_true, y_hat, pos_label=1)
ix = np.argmax(recall[precision >= .9])
return 'recall_at_precision', recall[ix], True
It always produces a perfect recall (1), regardless of the precision, even if the precision is set to 1.
Any chance to extent this amazing tool to Random Forest?
I've set n_jobs to a number greater than 1, sometimes it's running in parallel, but sometimes it's not.
Hi,
Thank you so much for the great package. It is really easy to use.
I am developing a project and need to apply nested cross validation for BoostRFE and want to get the score of each feature-set and each hyper param.
To make it a bit more clear:
In each iteration, the algorithm goes through feature selection and hyper parameter optimisation.
Using "model.trials_" I can get the parameters and using "model.scores_" I can get the score of each parameter!
However, I cannot find the result of the iterations for each feature set!
How can I get trials for both feature set and hyper parameters. And also the relevant performance for feature-set/hyper-param?
Should the model.trials return feature-set and hyper param all together?
Hi,
This package is awesome, any plans to implement k-fold or blocked time series for cross validation?
Thank You
Is it possible to use SHAP with sklearn pipeline (in my use case imblearn pipeline) where final estimator is XGBoost?
Hi,
I am still running a series of experiments with shap-hypertune. Some sort of cross-validation with a number of stratified K-fold splits.
For each split, I generate random seeds like this: np.random.randint(4294967295).
A typical run goes like this (there is one for each split):
11 trials detected for ('num_leaves', 'n_estimators', 'max_depth', 'learning_rate')
trial: 0001 ### iterations: 00008 ### eval_score: 0.94737
trial: 0002 ### iterations: 00018 ### eval_score: 0.92481
trial: 0003 ### iterations: 00020 ### eval_score: 0.99248
trial: 0004 ### iterations: 00017 ### eval_score: 0.97744
trial: 0005 ### iterations: 00025 ### eval_score: 0.98496
trial: 0006 ### iterations: 00012 ### eval_score: 0.97744
trial: 0007 ### iterations: 00020 ### eval_score: 0.99248
trial: 0008 ### iterations: 00012 ### eval_score: 0.98496
trial: 0009 ### iterations: 00021 ### eval_score: 0.98496
trial: 0010 ### iterations: 00018 ### eval_score: 0.98496
trial: 0011 ### iterations: 00025 ### eval_score: 0.98496
11 trials detected for ('num_leaves', 'n_estimators', 'max_depth', 'learning_rate')
trial: 0001 ### iterations: 00025 ### eval_score: 0.96241
trial: 0002 ### iterations: 00038 ### eval_score: 0.97744
trial: 0003 ### iterations: 00037 ### eval_score: 0.97744
trial: 0004 ### iterations: 00015 ### eval_score: 0.96241
trial: 0005 ### iterations: 00002 ### eval_score: 0.81203
trial: 0006 ### iterations: 00018 ### eval_score: 0.96241
trial: 0007 ### iterations: 00016 ### eval_score: 0.96241
trial: 0008 ### iterations: 00011 ### eval_score: 0.91729
trial: 0009 ### iterations: 00038 ### eval_score: 0.97744
trial: 0010 ### iterations: 00022 ### eval_score: 0.96241
trial: 0011 ### iterations: 00021 ### eval_score: 0.96992
However, sometimes the eval_score drops dramatically.
But this does not seem to be your typical stochastic behaviour.
For instance, normally, f it drops for one split it will drop for all the subsequent splits. In spite of the fact that a new seed is (pseudo) randomly generated for each split at each stage:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=np.random.randint(4294967295))
clf_lgbm = LGBMClassifier(boosting_type='rf',
random_state=np.random.randint(4294967295),
model = BoostRFA(
sampling_seed=np.random.randint(4294967295),
In other cases the number of iterations stays constant for each run:
11 trials detected for ('num_leaves', 'n_estimators', 'max_depth', 'learning_rate')
trial: 0001 ### iterations: 00001 ### eval_score: 0.69173
trial: 0002 ### iterations: 00001 ### eval_score: 0.7594
trial: 0003 ### iterations: 00001 ### eval_score: 0.69173
trial: 0004 ### iterations: 00001 ### eval_score: 0.69173
trial: 0005 ### iterations: 00001 ### eval_score: 0.79699
trial: 0006 ### iterations: 00001 ### eval_score: 0.69173
trial: 0007 ### iterations: 00001 ### eval_score: 0.69173
trial: 0008 ### iterations: 00001 ### eval_score: 0.7594
trial: 0009 ### iterations: 00001 ### eval_score: 0.69173
trial: 0010 ### iterations: 00001 ### eval_score: 0.69173
trial: 0011 ### iterations: 00001 ### eval_score: 0.69173
11 trials detected for ('num_leaves', 'n_estimators', 'max_depth', 'learning_rate')
trial: 0001 ### iterations: 00001 ### eval_score: 0.82707
trial: 0002 ### iterations: 00001 ### eval_score: 0.82707
trial: 0003 ### iterations: 00001 ### eval_score: 0.82707
trial: 0004 ### iterations: 00001 ### eval_score: 0.82707
trial: 0005 ### iterations: 00001 ### eval_score: 0.81955
trial: 0006 ### iterations: 00001 ### eval_score: 0.82707
trial: 0007 ### iterations: 00001 ### eval_score: 0.81955
trial: 0008 ### iterations: 00001 ### eval_score: 0.81955
trial: 0009 ### iterations: 00001 ### eval_score: 0.82707
trial: 0010 ### iterations: 00001 ### eval_score: 0.82707
trial: 0011 ### iterations: 00001 ### eval_score: 0.82707
If you re-run the script, you typically observe the normal behaviour again.
Hi,
While running BoostBoruta
according to the notebook toturial I'm getting the following warnings which I would like to suppress:
'early_stopping_rounds' argument is deprecated and will be removed in a future release of LightGBM. Pass 'early_stopping()' callback via 'callbacks' argument instead.
'verbose' argument is deprecated and will be removed in a future release of LightGBM. Pass 'log_evaluation()' callback via 'callbacks' argument instead.
Any ideas on how to do that?
Thank you
Hi!:) We're looking forward to using this library, it looks just what we're after in terms of feature selection!
We have our own little, lightweight time series pipelining framework that automatically wraps SKLearn classes, and found that BoostShap is not a subclass of SelectorMixin, which is the way we identify feature selectors (they have get_support()
method). Would it be possible to change the base class to SelectorMixin, to conform the sklearn conventions?
Thank you!
Hi,
Thank you for your effort for the great package.
I have some questions about your work.
Sorry for these simple questions. I do not have so much experience with XGboost.
Is there any plan to support catboost estimator alongside xgboost/lightgbm?
Hi, when I try to use distributions like quniform for finding the best num_leaves or max_depth,
'num_leaves': hp.quniform('num_leaves', 2, 25, 1)
,
I receive an error message like "Parameter num_leaves should be of type int, got "24.0""
I saw these distributions used for optimizing discrete parameters with just hyperopt. How could I use them with shap-hypetune?
For example I have a BoostRFE object, and have called the model.fit() function of it, then how can I get the exact list of SHAP importance corresponding to each features selected by the model's estimator (just like what shap.plot.bar shows)? The values of the output from shap.Explainer seems not that straightforward to me, I have not found any info about the shap or meaning of those values.
Or can I get the SHAP value directly through the BoostRFE object?
Hi, I am getting an error while running BoostBoruta for a binary classification task.
Size of data is:
`print(X_clf_train.shape, y_clf_train.shape)
print(X_clf_valid.shape, y_clf_valid.shape)
(102, 32) (102,)
(12, 32) (12,)
`
and here is the code I use:
`### BORUTA ###
model = BoostBoruta(
clf_xgb,
max_iter=200,
perc=100,
sampling_seed=0,
verbose=3,
n_jobs=-1,
)
model.fit(X_clf_train,
y_clf_train,
eval_set=[(X_clf_valid, y_clf_valid)],
early_stopping_rounds=6,
verbose=3)
print(model.n_features_)
`
and the error:
XGBoostError Traceback (most recent call last)
/tmp/ipykernel_4016678/3155018104.py in <cell line: 11>()
9 n_jobs=-1,
10 )
---> 11 model.fit(X_clf_train,
12 y_clf_train,
13 eval_set=[(X_clf_valid, y_clf_valid)],
~/myvenv/mykears3.9/lib/python3.9/site-packages/shaphypetune/_classes.py in fit(self, X, y, trials, **fit_params)
163
164 if self.param_grid is None:
--> 165 results = self._fit(X, y, fit_params)
166
167 for v in vars(results['model']):
~/myvenv/mykears3.9/lib/python3.9/site-packages/shaphypetune/_classes.py in _fit(self, X, y, fit_params, params)
66 model = self._build_model(params)
67 if isinstance(model, _BoostSelector):
---> 68 model.fit(X=X, y=y, **fit_params)
69 else:
70 with contextlib.redirect_stdout(io.StringIO()):
~/myvenv/mykears3.9/lib/python3.9/site-packages/shaphypetune/_classes.py in fit(self, X, y, **fit_params)
521 _X = self._create_X(X, feat_id_real)
522 with contextlib.redirect_stdout(io.StringIO()):
--> 523 estimator.fit(_X, y, **_fit_params)
524
525 # get coefs
~/myvenv/mykears3.9/lib/python3.9/site-packages/xgboost/core.py in inner_f(*args, **kwargs)
434 for k, arg in zip(sig.parameters, args):
435 kwargs[k] = arg
--> 436 return f(**kwargs)
437
438 return inner_f
~/myvenv/mykears3.9/lib/python3.9/site-packages/xgboost/sklearn.py in fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks)
1174 )
1175
-> 1176 self._Booster = train(
1177 params,
1178 train_dmatrix,
~/myvenv/mykears3.9/lib/python3.9/site-packages/xgboost/training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, xgb_model, callbacks)
187 Booster : a trained booster model
188 """
--> 189 bst = _train_internal(params, dtrain,
190 num_boost_round=num_boost_round,
191 evals=evals,
~/myvenv/mykears3.9/lib/python3.9/site-packages/xgboost/training.py in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks, evals_result, maximize, verbose_eval, early_stopping_rounds)
79 if callbacks.before_iteration(bst, i, dtrain, evals):
80 break
---> 81 bst.update(dtrain, i, obj)
82 if callbacks.after_iteration(bst, i, dtrain, evals):
83 break
~/myvenv/mykears3.9/lib/python3.9/site-packages/xgboost/core.py in update(self, dtrain, iteration, fobj)
1497
1498 if fobj is None:
-> 1499 _check_call(_LIB.XGBoosterUpdateOneIter(self.handle,
1500 ctypes.c_int(iteration),
1501 dtrain.handle))
~/myvenv/mykears3.9/lib/python3.9/site-packages/xgboost/core.py in _check_call(ret)
208 """
209 if ret != 0:
--> 210 raise XGBoostError(py_str(_LIB.XGBGetLastError()))
211
212
XGBoostError: [12:37:22] ../src/data/data.cc:583: Check failed: labels_.Size() == num_row_ (102 vs. 160) : Size of labels must equal to number of rows.
Stack trace:
[bt] (0) /home/zeydabadi/myvenv/mykears3.9/lib/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x9133f) [0x7fe8df99b33f]
[bt] (1) /home/zeydabadi/myvenv/mykears3.9/lib/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x110fcc) [0x7fe8dfa1afcc]
[bt] (2) /home/zeydabadi/myvenv/mykears3.9/lib/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x1b90e7) [0x7fe8dfac30e7]
[bt] (3) /home/zeydabadi/myvenv/mykears3.9/lib/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x1b99bc) [0x7fe8dfac39bc]
[bt] (4) /home/zeydabadi/myvenv/mykears3.9/lib/python3.9/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x50) [0x7fe8df98aed0]
[bt] (5) /lib64/libffi.so.6(ffi_call_unix64+0x4c) [0x7febb4ca610e]
[bt] (6) /lib64/libffi.so.6(ffi_call+0x36f) [0x7febb4ca5abf]
[bt] (7) /home/zeydabadi/mykeras/bin/usr/local/lib/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0x11235) [0x7febb4eba235]
[bt] (8) /home/zeydabadi/mykeras/bin/usr/local/lib/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0xaa66) [0x7febb4eb3a66]
In addition to random and grid search is it possible to add Bayesian search as well. The Tune Sklearn package supports multiple search algorithms using the Sklearn API
Hello,
Great package! very easy to use, and it is very effective! :)
I was wondering if it is possible to use custom CVs for random search + feature selection
Thanks!
Hi, I apologize if this is a dumb question ,but I can't find where to get the list of important features from the trained model? Thanks for any pointers.
This is an excellent repo to do the hyper-parameter tuning, and an approach to use SHAP measurement. A publication or preprint helps the practitioners to understand the repo deeper. Thus, I'm curious, do you have any plan to do this?
It would be nice to have options to select some state of art technique for hyperparameter optimization.
Such as:
https://scikit-optimize.github.io/stable/
https://github.com/optuna/optuna
or maybe the best (should be drop in replacement for scikit Grid/Random search, but support advanced techniques from packages above)
https://github.com/ray-project/tune-sklearn
Thank you for this great tool! I get this error while using BoostRFA. I believe it's coming from the rstate parameter in fmin function in hyperopt. In _classes.py, I see that the rstate has been assigned to "np.random.RandomState(self.sampling_seed)" which seems to be deprecated. In the below link, they recommend updating the rstate parameter to "np.random.default_rng(SEED)".
Please correct me if I'm wrong here. Thank you!
Hi,
If I use a custom metric like the brier score where lower is better, does this package support looking to minimize the eval metric? or is it by default trying to maximize?
Thank You
is there any way to fix the seed such that given the same input, BoostBoruta finds always the best k features?
I have a dataset with >2000 features. After I run BoostBoruta, 10 features show in the results, but they have no label/ind. information.
How can I retrieve feature importances and the original ind./labels mapping to the original feature set?
Hello,
First of all, thank you for this great repo. It looks very promising.
I'd like to use BoostBoruta within a scikit-pipeline. Is it possible?
For now, here is the code I've tried with no success :
# get the categorical and numeric column names
num_cols = X_train.select_dtypes(exclude=['object']).columns.tolist()
cat_cols = X_train.select_dtypes(include=['object']).columns.tolist()
# pipeline for numerical columns
num_pipe = make_pipeline(
StandardScaler()
)
# pipeline for categorical columns
cat_pipe = make_pipeline(
OneHotEncoder(handle_unknown='ignore', sparse=False)
)
# combine both the pipelines
full_pipe = ColumnTransformer([
('num', num_pipe, num_cols),
('cat', cat_pipe, cat_cols)
])
model = BoostBoruta(
clf_lgbm, param_grid=param_dist_hyperopt, n_iter=8, sampling_seed=0, importance_type="shap", train_importance=True,n_jobs=-1, verbose=2
)
pipeline_hypetune = make_pipeline(full_pipe, model)
model_selection = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=2022)
results = cross_validate(pipeline_hypetune, X_train, y, scoring='accuracy', cv=model_selection, return_estimator=True)
No exception is thrown but no model is learned either... Any ideas why?
Thanks in advance
Hi @cerlymarco ,
img-src : https://user-images.githubusercontent.com/42869040/162376574-03869b81-f11e-4d1f-8bea-eddb714d39b0.png
Thanks
Originally posted by @VinayChaudhari1996 in #4 (comment)
For my dataset I'm getting this error:
ExplainerError: Additivity check failed in TreeExplainer! Please ensure the data matrix you passed to the explainer is the same
shape that the model was trained on. If your data shape is correct then please report this on GitHub. Consider retrying with the
feature_perturbation='interventional' option. This check failed because for one of the samples the sum of the SHAP values was
-0.577556, while the model output was -0.540311. If this difference is acceptable you can set check_additivity=False to disable
this check.
I'm using it like this:
model = BoostRFE(regr_xgb, param_grid=param_dist,
min_features_to_select=10,
step=20,
importance_type='shap_importances',
n_iter=5
)
Any suggestion how to solve this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.