Git Product home page Git Product logo

Comments (8)

danielhomola avatar danielhomola commented on July 25, 2024

HI,

BorutaPy now works with all tree based methods of scikit learn, so it'll work with GradientBoostingClassifier which is pretty close to XGBoost.. XGBoost is super fast though so adding it would be amazing.

I think there's a scikit learn like interface for XGBoost that has the fit method which is great. The only other thing you need to use BorutaPy with XGBoost is a method to extract variable importance from the model. In BorutaPy this is done with scikit-learn's feature_importances_ property. Unfortunately the creators of the scikit like interface of XGBoost did not comply with this and they use a get_score() method to return the variable importances.

So I guess the easiest thing to do would be to take the scikit-learn like interface of XGBoost and extend the class with a feature_importances_ property (which would just call get_scores() under the hood).

I don't want toadd this to BorutaPy because it's supposed support scikit-learn and not other packages. Nonetheless if you found a way to this, let me know and I'll include it in the README as an example maybe.

from boruta_py.

gaw89 avatar gaw89 commented on July 25, 2024

Hello Daniel,

Thanks for your response. I looked into this a little more, and it appears (as you mentioned) that the issue lies with Xgboost implementation of the Sklearn API. I made some alterations and got it working with Boruta. I am working with Xgboost maintainers to see if they'll accept a PR for the updates to the API.

I will post back here if/when the PR gets through.

Great package by the way!

from boruta_py.

danielhomola avatar danielhomola commented on July 25, 2024

Sweet, thanks!

from boruta_py.

mbq avatar mbq commented on July 25, 2024

@gaw89 Out of curiosity, does this even work? I would expect XGBoost scores to just degenerate Boruta into a minimal optimal method...

from boruta_py.

danielhomola avatar danielhomola commented on July 25, 2024

hm interesting point Miron.. Is that because of this?

from boruta_py.

gaw89 avatar gaw89 commented on July 25, 2024

@mbq if you are referring to the link that Daniel posted, it seems you may be right, at least in the case of correlated features. However, I wonder - is the only difference between minimal optimal and all relevant feature selection that all relevant retains correlated features?

Also, since Boruta uses multiple runs of the wrapper algorithm, it seems pretty unlikely that, given correlated features A and B, it would choose A every time and never choose B (particularly when using a large number of runs of the algorithm). I suppose this also assumes a different random seed for each run of the underlying algorithm, but I am not sure about that.

I guess the real question here is, will there be any utility in using something other than Random Forest for the Boruta wrapper? Will I get a different/better set of features using XGBoost, decision tree, etc. than using Random Forest? My thought was, since I am using XGBoost for my final model, it would be better to use the same algorithm with the same parameters for my feature selection, but maybe this isn't the case. Do either of you have any thoughts on this?

from boruta_py.

gaw89 avatar gaw89 commented on July 25, 2024

Cool algorithm/package by the way!

from boruta_py.

mbq avatar mbq commented on July 25, 2024

Thanks!

This correlated features issue is a bit more complicated; imagine a set with features A, B and C so that both A and f(B,C) explains Y perfectly, and f is non-trivial in a sense that neither B nor C on its own is a good predictor. Now, no greedy CART-based method will ever touch B or C despite random seed; however, after removing A, they will happily build 100% accurate model on them.

One may ask who will need B or C when A does the job? The standard answer is that it allows for better understanding of the problem, which is useless if you only need a good model. The less standard is that it leads to a more robust models (like A might be clear in train, but may get more noisy in test); also, in p>>n, spurious random correlations may easily get indistinguishable from the real ones, so AR selection is better for designing further studies / doing meta-analyses (like A may be a lucky piece of nonsense). Formally, it is even more crazy because there is no minimal optimal selection when there is a perfect duplicate of information, but I'll leave it for now.

Back to your main question. Boruta is mostly for drawing a line between weakly relevant features and noise; it basically assumes that the importance source somehow homogeneously scans the feature space. Greedy methods like canonical boosting go in an opposite direction, hence my concern that they are not a best importance sources; but I haven't tested that, especially those stochastic modifications, so I don't know for sure.

About final model, theoretically, AR set is not fundamental, so it won't depend on the method that produced it provided it works well, so using same modelling in both places shouldn't be beneficial. What's more important, again theoretically, AR is totally redundant to model optimisation, while it is way more expensive than MO. The only but is this aforementioned robustness thing, but it should be relevant only in pathological cases anyway.

from boruta_py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.