When using boruta_py in a sklearn gridsearch, the error obje

First, merry Christmas to you all! :) Hi <a class="user-mention notr

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Make boruta_py suitable for GridSearches,about scikit-learn-contrib/boruta_py

Comments (11)

mbq commented on July 25, 2024 1

Sorry to interrupt, but this seem like a pretty bad idea. The point of Boruta is that it is an all relevant method, so it should be optimised for robust selection rather than for lowest post-selection error (what GridSearch does). I am afraid that adding such pursuit will simply degenerate the method into an incredibly inefficient random sampler.

from boruta_py.

MaxBenChrist commented on July 25, 2024

To solve this, one needs to implement a .get_params() method

from boruta_py.

danielhomola commented on July 25, 2024

If you submit a PR for this I'm more than happy to accept it, but I'll be very busy in the coming months unfortunately.. Cheers

from boruta_py.

MaxBenChrist commented on July 25, 2024

okay lets have a deal, I will submit a pr for this and you upload it to pypi? ;)

from boruta_py.

danielhomola commented on July 25, 2024

Alright, deal :) (but only in the 2nd or 3rd week of Jan)

from boruta_py.

danielhomola commented on July 25, 2024

Since Miron is the author of the Boruta algorithm, I'll trust him on this one. Unless you can convince him @MaxBenChrist :)

from boruta_py.

MaxBenChrist commented on July 25, 2024

First, merry Christmas to you all! :)

Hi @mbq, first, nice paper + algorithm. Regarding your doubts about using Boruta in a Grid Search:

Lets say we have a simple pipeline, first Boruta and then a classifier C. A Gridsearch over many different folds optimizes the parameters of Boruta for the best classifier performance C. Now you fear that Boruta will become a random sampler. I am not sure why.

How else should one determine the parameters (e.g. max_depth of random forest classifier, or number of estimators) of Boruta if not on the final classifier performance? On real world data sets, we don't know which features are relevant and which not. This is a general problem, if I have pipeline with a feature selection algorithm, I have to optimize all the parameters including those of the feature selection algorithm at the same time. I can't optimize the parameter of the feature selection step alone because there is no score / loss function for it.

from boruta_py.

mbq commented on July 25, 2024

Happy holidays!

The problem here is that you assume the classification error is minimal on the relevant set of features, which is false in general (because of redundant features, classifier characteristics, noise, overfitting, etc.). That's why there are two classes of FS methods, "minimal optimal" and "all relevant" (nice paper about this, with Bayesian net definitions).

This is obviously true that optimising an all relevant method is basically next to hopeless. Still, you may aim at robustness (like stability of the selection under perturbations to the input set; though it is only an upper bound, there may be a perfectly stable method which selects junk), use some domain knowledge or get some generally robust methods and hope for the best. The latter is what Boruta does; for RF classification max_depth is infinity by design (sic, otherwise this is just some random CART ensemble), default m is rarely significantly suboptimal, finally RF is expected to converge with the number of trees, so overshooting n only costs time.

from boruta_py.

MaxBenChrist commented on July 25, 2024

@mbq The paper is great. I studied it in detail and learned a lot. Thank you for that!

The problem here is that you assume the classification error is minimal on the relevant set of features, which is false in general (because of redundant features, classifier characteristics, noise, overfitting, etc.).

Actually, I deploy complex machine learning pipelines that contain both "all relevant" and "minimal optimal" or heavy regularized classifiers together. I create a huge amount of features and then use multiple layers of filtering/regularization/feature selection.

When you say that

optimising an all relevant method is basically next to hopeless.

Are you referring to Corollary 14 from the Nilsson paper here? ( The all-relevant problem requires exhaustive subset search. )

from boruta_py.

mbq commented on July 25, 2024

Even then, I think your pipeline would benefit more from Boruta with some sane default params as a filter of irrelevant attributes than from Boruta with parameters tuned to yield best accuracy (because I think it would mostly degenerate Boruta into returning few most obvious features or even pure noise); but I may be wrong, with a such meta-meta approach anything is possible.

Are you referring to Corollary 14 from the Nilsson paper here? ( The all-relevant problem requires exhaustive subset search. )

Well, no, rather that all relevant does not optimise error, thus it is hard to assess how all relevant some selection really is. Also Nilsson et al. consider asymptotic perfect case when you have a near perfect conditional probability estimates -- the whole Boruta mess is motivated by the fact this is really hard to achieve in problems that need feature selection.

from boruta_py.

MaxBenChrist commented on July 25, 2024

So what is your final position on the .get_params() method? :) Will you merge a pr containing such a method (triggering a warning about the all-revelant issues when used).

from boruta_py.

Make boruta_py suitable for GridSearches about boruta_py HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent