Comments (8)
HI,
BorutaPy now works with all tree based methods of scikit learn, so it'll work with GradientBoostingClassifier which is pretty close to XGBoost.. XGBoost is super fast though so adding it would be amazing.
I think there's a scikit learn like interface for XGBoost that has the fit method which is great. The only other thing you need to use BorutaPy with XGBoost is a method to extract variable importance from the model. In BorutaPy this is done with scikit-learn's feature_importances_
property. Unfortunately the creators of the scikit like interface of XGBoost did not comply with this and they use a get_score()
method to return the variable importances.
So I guess the easiest thing to do would be to take the scikit-learn like interface of XGBoost and extend the class with a feature_importances_
property (which would just call get_scores()
under the hood).
I don't want toadd this to BorutaPy because it's supposed support scikit-learn and not other packages. Nonetheless if you found a way to this, let me know and I'll include it in the README as an example maybe.
from boruta_py.
Hello Daniel,
Thanks for your response. I looked into this a little more, and it appears (as you mentioned) that the issue lies with Xgboost implementation of the Sklearn API. I made some alterations and got it working with Boruta. I am working with Xgboost maintainers to see if they'll accept a PR for the updates to the API.
I will post back here if/when the PR gets through.
Great package by the way!
from boruta_py.
Sweet, thanks!
from boruta_py.
@gaw89 Out of curiosity, does this even work? I would expect XGBoost scores to just degenerate Boruta into a minimal optimal method...
from boruta_py.
hm interesting point Miron.. Is that because of this?
from boruta_py.
@mbq if you are referring to the link that Daniel posted, it seems you may be right, at least in the case of correlated features. However, I wonder - is the only difference between minimal optimal and all relevant feature selection that all relevant retains correlated features?
Also, since Boruta uses multiple runs of the wrapper algorithm, it seems pretty unlikely that, given correlated features A and B, it would choose A every time and never choose B (particularly when using a large number of runs of the algorithm). I suppose this also assumes a different random seed for each run of the underlying algorithm, but I am not sure about that.
I guess the real question here is, will there be any utility in using something other than Random Forest for the Boruta wrapper? Will I get a different/better set of features using XGBoost, decision tree, etc. than using Random Forest? My thought was, since I am using XGBoost for my final model, it would be better to use the same algorithm with the same parameters for my feature selection, but maybe this isn't the case. Do either of you have any thoughts on this?
from boruta_py.
Cool algorithm/package by the way!
from boruta_py.
Thanks!
This correlated features issue is a bit more complicated; imagine a set with features A, B and C so that both A and f(B,C) explains Y perfectly, and f is non-trivial in a sense that neither B nor C on its own is a good predictor. Now, no greedy CART-based method will ever touch B or C despite random seed; however, after removing A, they will happily build 100% accurate model on them.
One may ask who will need B or C when A does the job? The standard answer is that it allows for better understanding of the problem, which is useless if you only need a good model. The less standard is that it leads to a more robust models (like A might be clear in train, but may get more noisy in test); also, in p>>n, spurious random correlations may easily get indistinguishable from the real ones, so AR selection is better for designing further studies / doing meta-analyses (like A may be a lucky piece of nonsense). Formally, it is even more crazy because there is no minimal optimal selection when there is a perfect duplicate of information, but I'll leave it for now.
Back to your main question. Boruta is mostly for drawing a line between weakly relevant features and noise; it basically assumes that the importance source somehow homogeneously scans the feature space. Greedy methods like canonical boosting go in an opposite direction, hence my concern that they are not a best importance sources; but I haven't tested that, especially those stochastic modifications, so I don't know for sure.
About final model, theoretically, AR set is not fundamental, so it won't depend on the method that produced it provided it works well, so using same modelling in both places shouldn't be beneficial. What's more important, again theoretically, AR is totally redundant to model optimisation, while it is way more expensive than MO. The only but is this aforementioned robustness thing, but it should be relevant only in pathological cases anyway.
from boruta_py.
Related Issues (20)
- ImportError: cannot import name 'BorutaPy' from 'boruta' HOT 1
- PKG for the survival analysis HOT 6
- Can I somehow speed the Borutapy process HOT 2
- Version update of Boruta on pypi? HOT 13
- What percentage of shadow features does each real feature outperform?
- AttributeError: module 'numpy' has no attribute 'int'. HOT 9
- Possible problems in installation HOT 1
- TypeError: BorutaPy.__init__() got an unexpected keyword argument 'early_stopping' HOT 1
- Kaggle n_estimators issue with DecisionTreeClassifier HOT 2
- Error when using BorutaPy with LogisticRegression
- AttributeError: module 'numpy' has no attribute 'bool' when using BorutaPy with RandomForestClassifier HOT 3
- BorutaPy selects different features in different iterations HOT 1
- AttributeError: module 'numpy' has no attribute 'int'. `np.int` was a deprecated alias for the builtin `int`. HOT 13
- Does boruta apply to time series data? HOT 1
- New release HOT 2
- Why does the number of total features (Confirmed + Tentative + Rejected) not equal to the input features?
- Error with package version specification upon installation HOT 1
- How to speed up the algorithm? HOT 1
- AttributeError: module 'numpy' has no attribute 'int'. HOT 4
- Error while using it with CatBoost
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from boruta_py.