Is BorutaPy compatible with XGBoost? If not, would you be interested in a PR for that

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

hm interesting point Miron.. Is that because of <a href="http://xgboost.readthedocs.io

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

XGBoost Support about boruta_py HOT 8 CLOSED

scikit-learn-contrib commented on September 23, 2024

XGBoost Support

from boruta_py.

Comments (8)

danielhomola commented on September 23, 2024

HI,

BorutaPy now works with all tree based methods of scikit learn, so it'll work with GradientBoostingClassifier which is pretty close to XGBoost.. XGBoost is super fast though so adding it would be amazing.

I think there's a scikit learn like interface for XGBoost that has the fit method which is great. The only other thing you need to use BorutaPy with XGBoost is a method to extract variable importance from the model. In BorutaPy this is done with scikit-learn's feature_importances_ property. Unfortunately the creators of the scikit like interface of XGBoost did not comply with this and they use a get_score() method to return the variable importances.

So I guess the easiest thing to do would be to take the scikit-learn like interface of XGBoost and extend the class with a feature_importances_ property (which would just call get_scores() under the hood).

I don't want toadd this to BorutaPy because it's supposed support scikit-learn and not other packages. Nonetheless if you found a way to this, let me know and I'll include it in the README as an example maybe.

from boruta_py.

gaw89 commented on September 23, 2024

Hello Daniel,

Thanks for your response. I looked into this a little more, and it appears (as you mentioned) that the issue lies with Xgboost implementation of the Sklearn API. I made some alterations and got it working with Boruta. I am working with Xgboost maintainers to see if they'll accept a PR for the updates to the API.

I will post back here if/when the PR gets through.

Great package by the way!

from boruta_py.

danielhomola commented on September 23, 2024

Sweet, thanks!

from boruta_py.

mbq commented on September 23, 2024

@gaw89 Out of curiosity, does this even work? I would expect XGBoost scores to just degenerate Boruta into a minimal optimal method...

from boruta_py.

danielhomola commented on September 23, 2024

hm interesting point Miron.. Is that because of this?

from boruta_py.

gaw89 commented on September 23, 2024

@mbq if you are referring to the link that Daniel posted, it seems you may be right, at least in the case of correlated features. However, I wonder - is the only difference between minimal optimal and all relevant feature selection that all relevant retains correlated features?

Also, since Boruta uses multiple runs of the wrapper algorithm, it seems pretty unlikely that, given correlated features A and B, it would choose A every time and never choose B (particularly when using a large number of runs of the algorithm). I suppose this also assumes a different random seed for each run of the underlying algorithm, but I am not sure about that.

I guess the real question here is, will there be any utility in using something other than Random Forest for the Boruta wrapper? Will I get a different/better set of features using XGBoost, decision tree, etc. than using Random Forest? My thought was, since I am using XGBoost for my final model, it would be better to use the same algorithm with the same parameters for my feature selection, but maybe this isn't the case. Do either of you have any thoughts on this?

from boruta_py.

gaw89 commented on September 23, 2024

Cool algorithm/package by the way!

from boruta_py.

mbq commented on September 23, 2024

Thanks!

This correlated features issue is a bit more complicated; imagine a set with features A, B and C so that both A and f(B,C) explains Y perfectly, and f is non-trivial in a sense that neither B nor C on its own is a good predictor. Now, no greedy CART-based method will ever touch B or C despite random seed; however, after removing A, they will happily build 100% accurate model on them.

One may ask who will need B or C when A does the job? The standard answer is that it allows for better understanding of the problem, which is useless if you only need a good model. The less standard is that it leads to a more robust models (like A might be clear in train, but may get more noisy in test); also, in p>>n, spurious random correlations may easily get indistinguishable from the real ones, so AR selection is better for designing further studies / doing meta-analyses (like A may be a lucky piece of nonsense). Formally, it is even more crazy because there is no minimal optimal selection when there is a perfect duplicate of information, but I'll leave it for now.

Back to your main question. Boruta is mostly for drawing a line between weakly relevant features and noise; it basically assumes that the importance source somehow homogeneously scans the feature space. Greedy methods like canonical boosting go in an opposite direction, hence my concern that they are not a best importance sources; but I haven't tested that, especially those stochastic modifications, so I don't know for sure.

About final model, theoretically, AR set is not fundamental, so it won't depend on the method that produced it provided it works well, so using same modelling in both places shouldn't be beneficial. What's more important, again theoretically, AR is totally redundant to model optimisation, while it is way more expensive than MO. The only but is this aforementioned robustness thing, but it should be relevant only in pathological cases anyway.

from boruta_py.

XGBoost Support about boruta_py HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent