Git Product home page Git Product logo

boruta_py's Introduction

scikit-learn-contrib

scikit-learn-contrib is a github organization for gathering high-quality scikit-learn compatible projects. It also provides a template for establishing new scikit-learn compatible projects.

Vision

With the explosion of the number of machine learning papers, it becomes increasingly difficult for users and researchers to implement and compare algorithms. Even when authors release their software, it takes time to learn how to use it and how to apply it to one's own purposes. The goal of scikit-learn-contrib is to provide easy-to-install and easy-to-use high-quality machine learning software. With scikit-learn-contrib, users can install a project by pip install sklearn-contrib-project-name and immediately try it on their data with the usual fit, predict and transform methods. In addition, projects are compatible with scikit-learn tools such as grid search, pipelines, etc.

Projects

If you would like to include your own project in scikit-learn-contrib, take a look at the workflow.

A simple-but-efficient density-based clustering algorithm that can find clusters of arbitrary size, shapes and densities in two-dimensions. Higher dimensions are first reduced to 2-D using the t-sne. The algorithm relies on a single parameter K, the number of nearest neighbors.

Read The Docs, Read the Paper

Maintained by: Mohamed Abbas

Large-scale linear classification, regression and ranking.

Maintained by Mathieu Blondel and Fabian Pedregosa.

Fast and modular Generalized Linear Models with support for models missing in scikit-learn.

Maintained by Mathurin Massias, Pierre-Antoine Bannier, Quentin Klopfenstein and Quentin Bertrand.

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines.

Maintained by Jason Rudy and Mehdi.

Python module to perform under sampling and over sampling with various techniques.

Maintained by Guillaume Lemaitre, Fernando Nogueira, Dayvid Oliveira and Christos Aridas.

Factorization machines and polynomial networks for classification and regression in Python.

Maintained by Vlad Niculae.

Confidence intervals for scikit-learn forest algorithms.

Maintained by Ariel Rokem, Kivan Polimis and Bryna Hazelton.

A high performance implementation of HDBSCAN clustering.

Maintained by Leland McInnes, jc-healy, c-north and Steve Astels.

A library of sklearn compatible categorical variable encoders.

Maintained by Will McGinnis and Paul Westenthanner

Python implementations of the Boruta all-relevant feature selection method.

Maintained by Daniel Homola

Pandas integration with sklearn.

Maintained by Israel Saeta Pérez

Machine learning with logical rules in Python.

Maintained by Florian Gardin, Ronan Gautier, Nicolas Goix and Jean-Matthieu Schertzer.

A Python implementation of the stability selection feature selection algorithm.

Maintained by Thomas Huijskens

Metric learning algorithms in Python.

Maintained by CJ Carey, Yuan Tang, William de Vazelhes, Aurélien Bellet and Nathalie Vauquier.

boruta_py's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

boruta_py's Issues

Bug when doing feature selection

I got following exception on the completion of feature selection by boruta:

Iteration: 34 / 100
Confirmed: 10
Tentative: 0
Rejected: 432

IndexError Traceback (most recent call last)
in ()

E:\Anaconda3\lib\site-packages\boruta\boruta_py.py in fit(self, X, y)
199 """
200
--> 201 return self._fit(X, y)
202
203 def transform(self, X, weak=False):

E:\Anaconda3\lib\site-packages\boruta\boruta_py.py in _fit(self, X, y)
334
335 # update rank for not_selected features
--> 336 if not_selected.shape[0] > 0 and not_selected.shape[1] > 0:
337 # calculate ranks in each iteration, then median of ranks across feats
338 iter_ranks = self._nanrankdata(imp_history_rejected, axis=1)

IndexError: tuple index out of range

The issue seems to be caused by 52d504b

and could be fixed by merging following commit from guitarmind@f68cfcd

"No module names Boruta"

i cloned repositery.. pasted in my anaconda/Lib/site-packages

still its not working. Its showing "No module named boruta_py"

I correctly installed this package in anaconda/Lib/site-packages
but still its giving me the error "No module named boruta_py". I checked boruta is installed in there but still same error
I restarted my spyder idle several times

New release

@danielhomola Hi! There were quite a few bug fixes since 0.1.5. Do you mind making a new release a putting it on pypi.org?

Results vary in R in Python implementation

It is mentioned "the two_step parameter has to be set to False, then (with perc=100) BorutaPy behaves exactly as the R version."
Inspite of doing this, results vary significantly. Is there a way to replicate the results exactly as the R version of the package?

Iterating over 0-d array

When running boruta if all features get selected the code gives error when _nanrankdata is called under BorutaPy

image

Tuple Index Out of range

Hi, my working environment is python 3.6.4 and boruta is the current version. When I runed the example dataset in the example folder of boruta. The following error message will show "tuple index out of range".
It seems to be some errors happened in this line
"if not_selected.shape[0] > 0 and not_selected.shape[1] > 0:"

Better test coverage

We could add some unit tests. Maybe some border cases with obvious irrelevant features.

Question: Feature Selection for Regression Problems

The examples provided apply Boruta for feature selection in classification problems. Can Boruta be accurately applied for feature selection in regression problems? If so, what regression estimator would be most appropriate? (i.e. RandomForestRegressor, GradientBoostingRegressor, etc.)

Categorical features

Hello,

More of a question than an issue, but how do you handle categorical features?

Also seems like y needs to be an int? Otherwise if I leave it as float I am getting an error.

Thanks!

Error when all variables are selected

When all variables are selected, this error will be thrown.


ValueError Traceback (most recent call last)
in ()
----> 1 feat_selector.fit(X,y)

/home/mike/dev/boruta_py/boruta/boruta_py2.py in fit(self, X, y)
170 """
171
--> 172 return self._fit(X, y)
173
174 def transform(self, X, weak=False):

/home/mike/dev/boruta_py/boruta/boruta_py2.py in fit(self, X, y)
322 else:
323 # and 2 otherwise
--> 324 ranks = ranks - np.min(ranks) + 2
325 self.ranking
[not_selected] = ranks
326

/home/mike/anaconda/envs/Python2.7/lib/python2.7/site-packages/numpy/core/fromnumeric.pyc in amin(a, axis, out, keepdims)
2357 else:
2358 return _methods._amin(a, axis=axis,
-> 2359 out=out, keepdims=keepdims)
2360
2361

/home/mike/anaconda/envs/Python2.7/lib/python2.7/site-packages/numpy/core/_methods.pyc in _amin(a, axis, out, keepdims)
27
28 def _amin(a, axis=None, out=None, keepdims=False):
---> 29 return umr_minimum(a, axis, None, out, keepdims)
30
31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):

ValueError: zero-size array to reduction operation minimum which has no identity

Debug Example:

selected = [0 1 2 3]
not_selected = []
imp_history_rejected = []
iter_ranks = []
rank_medians = []
ranks = []

(I'll send you a PR with fixes for these, just documenting them here so you know what the PR is doing.)

Overwrites model's random state

It seems that boruta passes RandomState(MT19937) to the model it is fitting regardless of the model's parameters. This doesn't bother a random forest model, but causes an xgBoost model to fail with the following error:

ValueError: Please check your X and y variable. The provided estimator cannot be fitted to your data. Invalid Parameter format for seed expect int but value='RandomState(MT19937)'

Cannot install boruta_py

Hello,

I am using python homebrew on my mac, and have some trouble installing boruta_py.
Could you help ?

Thanks !

Question: High Collinearity, how does Boruta handle?

Hello,

Thanks for the package, I found it quite interesting.

When there are variables that are highly correlated, could that effect the Z-scores?

The only reason why I ask is in the past I have seen groups of highly correlated variables where the variables within that group have varied widely in their importance.

Would it make sense to handle the col-linearity problem before running Boruta?

Sincerely,
G

Example Madalon_Data_Set.ipynb does not work

Dear all,
I'm getting acquainted with boruta and I have tried to execute the example Madalon_Data_Set.ipynb.
Everything works fine until

feat_selector.fit(X,y),

where I get the following error message:
`TypeError Traceback (most recent call last)
in ()
----> 1 feat_selector.fit(X,y)

/usr/local/lib/python3.5/dist-packages/boruta/boruta_py.py in fit(self, X, y)
199 """
200
--> 201 return self._fit(X, y)
202
203 def transform(self, X, weak=False):

/usr/local/lib/python3.5/dist-packages/boruta/boruta_py.py in _fit(self, X, y)
283
284 # add shadow attributes, shuffle them and train estimator, get imps
--> 285 cur_imp = self._add_shadows_get_imps(X, y, dec_reg)
286
287 # get the threshold of shadow importances we will use for rejection

/usr/local/lib/python3.5/dist-packages/boruta/boruta_py.py in _add_shadows_get_imps(self, X, y, dec_reg)
396 # find features that are tentative still
397 x_cur_ind = np.where(dec_reg >= 0)[0]
--> 398 x_cur = np.copy(X[:, x_cur_ind])
399 x_cur_w = x_cur.shape[1]
400 # deep copy the matrix for the shadow matrix

/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in getitem(self, key)
1962 return self._getitem_multilevel(key)
1963 else:
-> 1964 return self._getitem_column(key)
1965
1966 def _getitem_column(self, key):

/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in _getitem_column(self, key)
1969 # get column
1970 if self.columns.is_unique:
-> 1971 return self._get_item_cache(key)
1972
1973 # duplicate columns & possible reduce dimensionality

/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py in _get_item_cache(self, item)
1641 """Return the cached item, item represents a label indexer."""
1642 cache = self._item_cache
-> 1643 res = cache.get(item)
1644 if res is None:
1645 values = self._data.get(item)

TypeError: unhashable type: 'slice'`

Is there a simple solution?

I'm working on a linux machine with python 3.5. My python packages are up to date.

Many thanks in advance,
Flavio

Acces Z-score individual variables

Is it possible to access the individual Z-scores of variables? Such as to make a visualization that has been done in Fig. 2 of the original paper.
image

XGBoost Support

Is BorutaPy compatible with XGBoost? If not, would you be interested in a PR for that compatibility (assuming it's possible and I can figure it out)?

It seems to me that this is not currently supported since I got an error when I tried it with XGBClassifier, but I wanted to know if there's any official word.

Thanks!

BorutaPy not existent in boruta_py

Dear Dan - thanks for the code, I was having issues with the 0 array rerpoted elsewhere which you had fixed by posting that the package should be cloned direcrt from Github using which I did ,
!git clone https://github.com/scikit-learn-contrib/boruta_py.git
If I then attempt to (per the example here: http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/ import BorutaPy from .boruta_py or boruta_py, the module doesnt exist:
ImportError: cannot import name 'boruta_py' from 'boruta_py' (unknown location)
whereas while installing boruta_py via git install was succesful:

(base) C:\Users\Amin>git clone https://github.com/scikit-learn-contrib/boruta_py.git
Cloning into 'boruta_py'...
remote: Enumerating objects: 17, done.
remote: Counting objects: 100% (17/17), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 284 (delta 8), reused 16 (delta 7), pack-reused 267
Receiving objects: 100% (284/284), 146.78 KiB | 853.00 KiB/s, done.
Resolving deltas: 100% (134/134), done.

do you have advice on how to fix this? the pip version will have the 0 array issue and this one cannot locate Boruta_py

When using Pandas DataFrames

_add_shadows_get_imps() fails when X is pandas rather than numpy

Pandas DF can no longer be sliced as
x_cur = np.copy(X[:, x_cur_ind])

x_cur = np.copy(X.as_matrix()[:, x_cur_ind])
OR
x_cur = np.copy(X.ix[:, x_cur_ind])

I'd recommend testing/casting dataframes to numpy arrays in _fit

_nanrankdata

Thank you for your nice implementation of boruta!
However, I am wondering why you reimplemented the nanrankdata method of bottleneck. The reason I am wondering is that your new implementation can not handle cases, where the input data is in the form

X = [..., [nan, nan, nan], ...]

whereas the bottleneck.nanrankdata can (and this case occurs in my examples). I am sure there was a good reason to not use this method from bottleneck....

Thank you for your help!

iteration over a 0-d array in `_nanrankdata`

I gather that others have hit this (#12) but it still seems like a live issue, I'm afraid. It's hitting when X and y are ndarrays of what looks like the right shape.

There's a reproducible example on Iris data here.

Plotting the result

Is it possible to create box plots for the feature ranks? Similar to the one in R?

Setup CI for automated tests

A Continuous Integration service should be configured in order on each PR the test suite will run automatically.

n_jobs = -1 does not work

Hello,

I am trying to use all my cores when using boruta_py.
I set for instance, n_jobs=-1 inside randomforest class but only one core does the job.
I get no errors/warnings

can't reproduce the example

i can't reproduce your example with my dataset .
here's the error that i'm getting

 File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 380, in _get_imp
    self.estimator.fit(X, y)
  File "/home/imahmoudi/python/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 272, in fit
    y, expanded_class_weight = self._validate_y_class_weight(y)
  File "/home/imahmoudi/python/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 493, in _validate_y_                                                       class_weight
    % self.class_weight)
ValueError: Valid presets for class_weight include "balanced" and "balanced_subsample". Given "auto".

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "FSelection.py", line 31, in <module>
    feat_selector.fit(X[:5000], y[:5000])
  File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 201, in fit
    return self._fit(X, y)
  File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 285, in _fit
    cur_imp = self._add_shadows_get_imps(X, y, dec_reg)
  File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 408, in _add_shadows_get_im                                                       ps
    imp = self._get_imp(np.hstack((x_cur, x_sha)), y)
  File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 383, in _get_imp
    'estimator cannot be fitted to your data.\n' + e)
TypeError: must be str, not ValueError

any idea how to solve this ?

Pip install Botura no working

I have been trying to pip3 install Boruta but it is not working, returning the following error:

Collecting Boruta
Could not find a version that satisfies the requirement Botura (from versions: )
No matching distribution found for Botura

Any idea how to resolve this issue?

Thanks

Confirmed + Tentative + Rejected not always equal to number of features in final iteration?

Hi,

it seems that throughout the _print_results method, the Confirmed + Tentative + Rejected is correct except for the final cycle in which it appears the "Tentative" value is incorrect. Could this mean self.support_weak_ is not always being updated correctly?

Here is an example of the output (see last cycle, where the values don't sum to the number of features?):

Iteration: 1 / 150
Confirmed: 0
Tentative: 47
Rejected: 0

....

Iteration: 148 / 150
Confirmed: 10
Tentative: 2
Rejected: 35

Iteration: 149 / 150
Confirmed: 10
Tentative: 2
Rejected: 35

BorutaPy finished running.

Iteration: 150 / 150
Confirmed: 10
Tentative: 1
Rejected: 35

Ranking the features

Hi,
I want to get the ranking of features according to their scores, I know random forest can score each feature with different scores. feat_selector.ranking_ can output some ranked features, but many features have the same level , e.g. [ 1 1 1 1 1 1 1 28 7 12 16 17 2 31 32 25 30 4 3 24 28 133 386 415 426 117 493 407 185 453 202 310 199 73 74 302], could I ask how can I further rank the features for those features with the same level. thanks.

Python3?

Just started using your code. I also just started using python 3.5. Do you have any plans for 3.5 support? If not I'll give it a shot.

boruta doesn't accept sparse matrices?

Hi, it looks like Boruta doesn't by default accept sparse matrices (I assume this is determined in line 517 of boruta_py.py, since the default value of accept_sparse is false for check_X_y). I get the following error:

A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

Is there any particular reason for this, given RandomForestClassifier does accept sparse matrices, at least in the more recent releases of sklearn?

idea: early stopping based on % tenative

I have a feature idea: maybe it would be possible to stop early if the number of tentative features reaches a threshold (possibly a percentage of the full feature set, or a specific number, or if we want to get fancy - a function parameter that returns a boolean.)

Why?
I noticed that in one instance, Boruta has less than 5% of my features marked as tentative after less than 10 rounds, but then it may take many many rounds to classify these 5%. In a lot of cases I would be fine just calling all of these confirmed.

I could work on a PR for this, but thought I would ask before I start working on it.

Make boruta_py suitable for GridSearches

When using boruta_py in a sklearn gridsearch, the error object has no attribute 'get_params' occurs. It would be interesting, if one could also optimize the parameter of the boruta feature selection

Transfer this project to scikit-learn-contrib

Hi,

this is not really an 'issue', only a suggestion. You could transfer this project to the scikit-learn-contrib which is a collection of algorithms compatible with scikit-learn, but not yet in a state where they are merged. This could greatly increase the visibility of your project. What do you think about this?

On the use of pruned trees

The README states:
"We highly recommend using pruned trees with a depth between 3-7."

For my data, a depth of 3 rejects the fewest variables, increasing depth to 7 to a greater degree, and no pruning at all results in all variables being rejected.

I'm curious to know why this is, as I had expected that greater depth would afford greater sensitivity to subtle interactions between features.

Issue when run your example

Hi,

When I run your example code, at line 'feat_selector.fit(X,y)', I have red words 'TypeError: unhashable type: 'slice''. So I tried to change y = y.values and x = x.values. Then after 99 iterations (maxrun = 100), there is another red words 'TypeError: iteration over a 0-d array'.

So I was wondering what happen there... Thanks a lot

hits in sp.stats.binom

Hi,

Is there a reason why we use "hits-1" in "to_accept_ps", but "hits" in "to_reject_ps" when computing the binomial distribution:

to_accept_ps = sp.stats.binom.sf(hits - 1, _iter, .5).flatten()
to_reject_ps = sp.stats.binom.cdf(hits, _iter, .5).flatten()

Should'nt we use hits in both cases?

Many thanks

Don't work well in Madalon Data set when I set `two_step=True`

Dear all

Run Madalon_Data_Set.ipynb, and I got only 1 important feature.
According to the original paper, Madalon data set contains 20 important features.
But, I could get one feature. Why?

On the other hand, I could get approximately 20 features when I set two_step=False.

My results are following.

two step is True

rf = RandomForestClassifier(n_jobs=int(cpu_count()/2), class_weight='balanced', max_depth=7)
# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', two_step=True, verbose=2, random_state=42)
BorutaPy finished running.

Iteration: 	35 / 100
Confirmed: 	1
Tentative: 	0
Rejected: 	498

two step is false

rf = RandomForestClassifier(n_jobs=int(cpu_count()/2), class_weight='balanced', max_depth=7)
# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', two_step=False, verbose=2, random_state=42)
BorutaPy finished running.

Iteration: 	55 / 100
Confirmed: 	21
Tentative: 	0
Rejected: 	478

Running CLF on basic sklearn datasets

Python2.7 64, sklearn .18.1, Boruta .1.5

Results of running CLF on basic sklearn datasets:
IRIS fails
BREAST_CANCER fails
DIGITS completes

There error is:
File "C:\Python27\lib\site-packages\scipy\stats\mstats_basic.py", line 254, in _rank1d
for r in repeats[0]:
TypeError: iteration over a 0-d array

RandomForestClassifier params: n_estimators=3, max_depth=3
BorutaPy params: perc=100, alpha=.01

For BREAST_CANCER, when changing the classifier to max_depth of 1, the code runs.
For IRIS, when changing the classifier to max_depth of 1, the code still fails. Further, it seems like no parameters work for the IRIS dataset.

Is there a way to improve the stability? I saw this old issue/commit but the error seems more extensive.
80a74c1

iteration over a 0-d array

Hi Team -
I have faced this "iteration over a 0-d array" for a specific data set and read all the QA and understood it is fixed ( if i am right ). But it seems problem persists for A dataset(wine) .
There is NO nan values in any rows/columns or full array of nan values.but i am facing this issue.

It would of great help if u guide me on this , unless i am wrongly coded. Thanks
Here is the dataset and code
wine.csv.zip
FEATURE_SELECTION_BORUTA.py.zip

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.