Git Product home page Git Product logo

Comments (11)

mbq avatar mbq commented on July 25, 2024 1

Sorry to interrupt, but this seem like a pretty bad idea. The point of Boruta is that it is an all relevant method, so it should be optimised for robust selection rather than for lowest post-selection error (what GridSearch does). I am afraid that adding such pursuit will simply degenerate the method into an incredibly inefficient random sampler.

from boruta_py.

MaxBenChrist avatar MaxBenChrist commented on July 25, 2024

To solve this, one needs to implement a .get_params() method

from boruta_py.

danielhomola avatar danielhomola commented on July 25, 2024

If you submit a PR for this I'm more than happy to accept it, but I'll be very busy in the coming months unfortunately.. Cheers

from boruta_py.

MaxBenChrist avatar MaxBenChrist commented on July 25, 2024

okay lets have a deal, I will submit a pr for this and you upload it to pypi? ;)

from boruta_py.

danielhomola avatar danielhomola commented on July 25, 2024

Alright, deal :) (but only in the 2nd or 3rd week of Jan)

from boruta_py.

danielhomola avatar danielhomola commented on July 25, 2024

Since Miron is the author of the Boruta algorithm, I'll trust him on this one. Unless you can convince him @MaxBenChrist :)

from boruta_py.

MaxBenChrist avatar MaxBenChrist commented on July 25, 2024

First, merry Christmas to you all! :)

Hi @mbq, first, nice paper + algorithm. Regarding your doubts about using Boruta in a Grid Search:

Lets say we have a simple pipeline, first Boruta and then a classifier C. A Gridsearch over many different folds optimizes the parameters of Boruta for the best classifier performance C. Now you fear that Boruta will become a random sampler. I am not sure why.

How else should one determine the parameters (e.g. max_depth of random forest classifier, or number of estimators) of Boruta if not on the final classifier performance? On real world data sets, we don't know which features are relevant and which not. This is a general problem, if I have pipeline with a feature selection algorithm, I have to optimize all the parameters including those of the feature selection algorithm at the same time. I can't optimize the parameter of the feature selection step alone because there is no score / loss function for it.

from boruta_py.

mbq avatar mbq commented on July 25, 2024

Happy holidays!

The problem here is that you assume the classification error is minimal on the relevant set of features, which is false in general (because of redundant features, classifier characteristics, noise, overfitting, etc.). That's why there are two classes of FS methods, "minimal optimal" and "all relevant" (nice paper about this, with Bayesian net definitions).

This is obviously true that optimising an all relevant method is basically next to hopeless. Still, you may aim at robustness (like stability of the selection under perturbations to the input set; though it is only an upper bound, there may be a perfectly stable method which selects junk), use some domain knowledge or get some generally robust methods and hope for the best. The latter is what Boruta does; for RF classification max_depth is infinity by design (sic, otherwise this is just some random CART ensemble), default m is rarely significantly suboptimal, finally RF is expected to converge with the number of trees, so overshooting n only costs time.

from boruta_py.

MaxBenChrist avatar MaxBenChrist commented on July 25, 2024

@mbq The paper is great. I studied it in detail and learned a lot. Thank you for that!

The problem here is that you assume the classification error is minimal on the relevant set of features, which is false in general (because of redundant features, classifier characteristics, noise, overfitting, etc.).

Actually, I deploy complex machine learning pipelines that contain both "all relevant" and "minimal optimal" or heavy regularized classifiers together. I create a huge amount of features and then use multiple layers of filtering/regularization/feature selection.

When you say that

optimising an all relevant method is basically next to hopeless.

Are you referring to Corollary 14 from the Nilsson paper here? ( The all-relevant problem requires exhaustive subset search. )

from boruta_py.

mbq avatar mbq commented on July 25, 2024

Even then, I think your pipeline would benefit more from Boruta with some sane default params as a filter of irrelevant attributes than from Boruta with parameters tuned to yield best accuracy (because I think it would mostly degenerate Boruta into returning few most obvious features or even pure noise); but I may be wrong, with a such meta-meta approach anything is possible.

Are you referring to Corollary 14 from the Nilsson paper here? ( The all-relevant problem requires exhaustive subset search. )

Well, no, rather that all relevant does not optimise error, thus it is hard to assess how all relevant some selection really is. Also Nilsson et al. consider asymptotic perfect case when you have a near perfect conditional probability estimates -- the whole Boruta mess is motivated by the fact this is really hard to achieve in problems that need feature selection.

from boruta_py.

MaxBenChrist avatar MaxBenChrist commented on July 25, 2024

So what is your final position on the .get_params() method? :) Will you merge a pr containing such a method (triggering a warning about the all-revelant issues when used).

from boruta_py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.