Comments (11)
Sorry to interrupt, but this seem like a pretty bad idea. The point of Boruta is that it is an all relevant method, so it should be optimised for robust selection rather than for lowest post-selection error (what GridSearch does). I am afraid that adding such pursuit will simply degenerate the method into an incredibly inefficient random sampler.
from boruta_py.
To solve this, one needs to implement a .get_params()
method
from boruta_py.
If you submit a PR for this I'm more than happy to accept it, but I'll be very busy in the coming months unfortunately.. Cheers
from boruta_py.
okay lets have a deal, I will submit a pr for this and you upload it to pypi? ;)
from boruta_py.
Alright, deal :) (but only in the 2nd or 3rd week of Jan)
from boruta_py.
Since Miron is the author of the Boruta algorithm, I'll trust him on this one. Unless you can convince him @MaxBenChrist :)
from boruta_py.
First, merry Christmas to you all! :)
Hi @mbq, first, nice paper + algorithm. Regarding your doubts about using Boruta in a Grid Search:
Lets say we have a simple pipeline, first Boruta and then a classifier C. A Gridsearch over many different folds optimizes the parameters of Boruta for the best classifier performance C. Now you fear that Boruta will become a random sampler. I am not sure why.
How else should one determine the parameters (e.g. max_depth of random forest classifier, or number of estimators) of Boruta if not on the final classifier performance? On real world data sets, we don't know which features are relevant and which not. This is a general problem, if I have pipeline with a feature selection algorithm, I have to optimize all the parameters including those of the feature selection algorithm at the same time. I can't optimize the parameter of the feature selection step alone because there is no score / loss function for it.
from boruta_py.
Happy holidays!
The problem here is that you assume the classification error is minimal on the relevant set of features, which is false in general (because of redundant features, classifier characteristics, noise, overfitting, etc.). That's why there are two classes of FS methods, "minimal optimal" and "all relevant" (nice paper about this, with Bayesian net definitions).
This is obviously true that optimising an all relevant method is basically next to hopeless. Still, you may aim at robustness (like stability of the selection under perturbations to the input set; though it is only an upper bound, there may be a perfectly stable method which selects junk), use some domain knowledge or get some generally robust methods and hope for the best. The latter is what Boruta does; for RF classification max_depth is infinity by design (sic, otherwise this is just some random CART ensemble), default m is rarely significantly suboptimal, finally RF is expected to converge with the number of trees, so overshooting n only costs time.
from boruta_py.
@mbq The paper is great. I studied it in detail and learned a lot. Thank you for that!
The problem here is that you assume the classification error is minimal on the relevant set of features, which is false in general (because of redundant features, classifier characteristics, noise, overfitting, etc.).
Actually, I deploy complex machine learning pipelines that contain both "all relevant" and "minimal optimal" or heavy regularized classifiers together. I create a huge amount of features and then use multiple layers of filtering/regularization/feature selection.
When you say that
optimising an all relevant method is basically next to hopeless.
Are you referring to Corollary 14 from the Nilsson paper here? ( The all-relevant problem requires exhaustive subset search. )
from boruta_py.
Even then, I think your pipeline would benefit more from Boruta with some sane default params as a filter of irrelevant attributes than from Boruta with parameters tuned to yield best accuracy (because I think it would mostly degenerate Boruta into returning few most obvious features or even pure noise); but I may be wrong, with a such meta-meta approach anything is possible.
Are you referring to Corollary 14 from the Nilsson paper here? ( The all-relevant problem requires exhaustive subset search. )
Well, no, rather that all relevant does not optimise error, thus it is hard to assess how all relevant some selection really is. Also Nilsson et al. consider asymptotic perfect case when you have a near perfect conditional probability estimates -- the whole Boruta mess is motivated by the fact this is really hard to achieve in problems that need feature selection.
from boruta_py.
So what is your final position on the .get_params()
method? :) Will you merge a pr containing such a method (triggering a warning about the all-revelant issues when used).
from boruta_py.
Related Issues (20)
- Numpy types aliases deprecated (`np.int`, `np.bool` and `np.float`)
- why estimators num is calculated by feature num in this way?
- max_iter values HOT 3
- ImportError: cannot import name 'BorutaPy' from 'boruta' HOT 1
- PKG for the survival analysis HOT 6
- Can I somehow speed the Borutapy process HOT 2
- Version update of Boruta on pypi? HOT 5
- What percentage of shadow features does each real feature outperform?
- AttributeError: module 'numpy' has no attribute 'int'. HOT 9
- Possible problems in installation HOT 1
- TypeError: BorutaPy.__init__() got an unexpected keyword argument 'early_stopping' HOT 1
- Kaggle n_estimators issue with DecisionTreeClassifier HOT 2
- Error when using BorutaPy with LogisticRegression
- AttributeError: module 'numpy' has no attribute 'bool' when using BorutaPy with RandomForestClassifier HOT 3
- BorutaPy selects different features in different iterations HOT 1
- AttributeError: module 'numpy' has no attribute 'int'. `np.int` was a deprecated alias for the builtin `int`. HOT 13
- Does boruta apply to time series data? HOT 1
- New release HOT 1
- Why does the number of total features (Confirmed + Tentative + Rejected) not equal to the input features?
- Error with package version specification upon installation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from boruta_py.