Git Product home page Git Product logo

boruta_py's Issues

Error when all variables are selected

When all variables are selected, this error will be thrown.


ValueError Traceback (most recent call last)
in ()
----> 1 feat_selector.fit(X,y)

/home/mike/dev/boruta_py/boruta/boruta_py2.py in fit(self, X, y)
170 """
171
--> 172 return self._fit(X, y)
173
174 def transform(self, X, weak=False):

/home/mike/dev/boruta_py/boruta/boruta_py2.py in fit(self, X, y)
322 else:
323 # and 2 otherwise
--> 324 ranks = ranks - np.min(ranks) + 2
325 self.ranking
[not_selected] = ranks
326

/home/mike/anaconda/envs/Python2.7/lib/python2.7/site-packages/numpy/core/fromnumeric.pyc in amin(a, axis, out, keepdims)
2357 else:
2358 return _methods._amin(a, axis=axis,
-> 2359 out=out, keepdims=keepdims)
2360
2361

/home/mike/anaconda/envs/Python2.7/lib/python2.7/site-packages/numpy/core/_methods.pyc in _amin(a, axis, out, keepdims)
27
28 def _amin(a, axis=None, out=None, keepdims=False):
---> 29 return umr_minimum(a, axis, None, out, keepdims)
30
31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):

ValueError: zero-size array to reduction operation minimum which has no identity

Debug Example:

selected = [0 1 2 3]
not_selected = []
imp_history_rejected = []
iter_ranks = []
rank_medians = []
ranks = []

(I'll send you a PR with fixes for these, just documenting them here so you know what the PR is doing.)

Question: Feature Selection for Regression Problems

The examples provided apply Boruta for feature selection in classification problems. Can Boruta be accurately applied for feature selection in regression problems? If so, what regression estimator would be most appropriate? (i.e. RandomForestRegressor, GradientBoostingRegressor, etc.)

Tuple Index Out of range

Hi, my working environment is python 3.6.4 and boruta is the current version. When I runed the example dataset in the example folder of boruta. The following error message will show "tuple index out of range".
It seems to be some errors happened in this line
"if not_selected.shape[0] > 0 and not_selected.shape[1] > 0:"

Example Madalon_Data_Set.ipynb does not work

Dear all,
I'm getting acquainted with boruta and I have tried to execute the example Madalon_Data_Set.ipynb.
Everything works fine until

feat_selector.fit(X,y),

where I get the following error message:
`TypeError Traceback (most recent call last)
in ()
----> 1 feat_selector.fit(X,y)

/usr/local/lib/python3.5/dist-packages/boruta/boruta_py.py in fit(self, X, y)
199 """
200
--> 201 return self._fit(X, y)
202
203 def transform(self, X, weak=False):

/usr/local/lib/python3.5/dist-packages/boruta/boruta_py.py in _fit(self, X, y)
283
284 # add shadow attributes, shuffle them and train estimator, get imps
--> 285 cur_imp = self._add_shadows_get_imps(X, y, dec_reg)
286
287 # get the threshold of shadow importances we will use for rejection

/usr/local/lib/python3.5/dist-packages/boruta/boruta_py.py in _add_shadows_get_imps(self, X, y, dec_reg)
396 # find features that are tentative still
397 x_cur_ind = np.where(dec_reg >= 0)[0]
--> 398 x_cur = np.copy(X[:, x_cur_ind])
399 x_cur_w = x_cur.shape[1]
400 # deep copy the matrix for the shadow matrix

/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in getitem(self, key)
1962 return self._getitem_multilevel(key)
1963 else:
-> 1964 return self._getitem_column(key)
1965
1966 def _getitem_column(self, key):

/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in _getitem_column(self, key)
1969 # get column
1970 if self.columns.is_unique:
-> 1971 return self._get_item_cache(key)
1972
1973 # duplicate columns & possible reduce dimensionality

/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py in _get_item_cache(self, item)
1641 """Return the cached item, item represents a label indexer."""
1642 cache = self._item_cache
-> 1643 res = cache.get(item)
1644 if res is None:
1645 values = self._data.get(item)

TypeError: unhashable type: 'slice'`

Is there a simple solution?

I'm working on a linux machine with python 3.5. My python packages are up to date.

Many thanks in advance,
Flavio

Don't work well in Madalon Data set when I set `two_step=True`

Dear all

Run Madalon_Data_Set.ipynb, and I got only 1 important feature.
According to the original paper, Madalon data set contains 20 important features.
But, I could get one feature. Why?

On the other hand, I could get approximately 20 features when I set two_step=False.

My results are following.

two step is True

rf = RandomForestClassifier(n_jobs=int(cpu_count()/2), class_weight='balanced', max_depth=7)
# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', two_step=True, verbose=2, random_state=42)
BorutaPy finished running.

Iteration: 	35 / 100
Confirmed: 	1
Tentative: 	0
Rejected: 	498

two step is false

rf = RandomForestClassifier(n_jobs=int(cpu_count()/2), class_weight='balanced', max_depth=7)
# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', two_step=False, verbose=2, random_state=42)
BorutaPy finished running.

Iteration: 	55 / 100
Confirmed: 	21
Tentative: 	0
Rejected: 	478

Confirmed + Tentative + Rejected not always equal to number of features in final iteration?

Hi,

it seems that throughout the _print_results method, the Confirmed + Tentative + Rejected is correct except for the final cycle in which it appears the "Tentative" value is incorrect. Could this mean self.support_weak_ is not always being updated correctly?

Here is an example of the output (see last cycle, where the values don't sum to the number of features?):

Iteration: 1 / 150
Confirmed: 0
Tentative: 47
Rejected: 0

....

Iteration: 148 / 150
Confirmed: 10
Tentative: 2
Rejected: 35

Iteration: 149 / 150
Confirmed: 10
Tentative: 2
Rejected: 35

BorutaPy finished running.

Iteration: 150 / 150
Confirmed: 10
Tentative: 1
Rejected: 35

Iterating over 0-d array

When running boruta if all features get selected the code gives error when _nanrankdata is called under BorutaPy

image

Overwrites model's random state

It seems that boruta passes RandomState(MT19937) to the model it is fitting regardless of the model's parameters. This doesn't bother a random forest model, but causes an xgBoost model to fail with the following error:

ValueError: Please check your X and y variable. The provided estimator cannot be fitted to your data. Invalid Parameter format for seed expect int but value='RandomState(MT19937)'

_nanrankdata

Thank you for your nice implementation of boruta!
However, I am wondering why you reimplemented the nanrankdata method of bottleneck. The reason I am wondering is that your new implementation can not handle cases, where the input data is in the form

X = [..., [nan, nan, nan], ...]

whereas the bottleneck.nanrankdata can (and this case occurs in my examples). I am sure there was a good reason to not use this method from bottleneck....

Thank you for your help!

Better test coverage

We could add some unit tests. Maybe some border cases with obvious irrelevant features.

Pip install Botura no working

I have been trying to pip3 install Boruta but it is not working, returning the following error:

Collecting Boruta
Could not find a version that satisfies the requirement Botura (from versions: )
No matching distribution found for Botura

Any idea how to resolve this issue?

Thanks

Acces Z-score individual variables

Is it possible to access the individual Z-scores of variables? Such as to make a visualization that has been done in Fig. 2 of the original paper.
image

BorutaPy not existent in boruta_py

Dear Dan - thanks for the code, I was having issues with the 0 array rerpoted elsewhere which you had fixed by posting that the package should be cloned direcrt from Github using which I did ,
!git clone https://github.com/scikit-learn-contrib/boruta_py.git
If I then attempt to (per the example here: http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/ import BorutaPy from .boruta_py or boruta_py, the module doesnt exist:
ImportError: cannot import name 'boruta_py' from 'boruta_py' (unknown location)
whereas while installing boruta_py via git install was succesful:

(base) C:\Users\Amin>git clone https://github.com/scikit-learn-contrib/boruta_py.git
Cloning into 'boruta_py'...
remote: Enumerating objects: 17, done.
remote: Counting objects: 100% (17/17), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 284 (delta 8), reused 16 (delta 7), pack-reused 267
Receiving objects: 100% (284/284), 146.78 KiB | 853.00 KiB/s, done.
Resolving deltas: 100% (134/134), done.

do you have advice on how to fix this? the pip version will have the 0 array issue and this one cannot locate Boruta_py

iteration over a 0-d array

Hi Team -
I have faced this "iteration over a 0-d array" for a specific data set and read all the QA and understood it is fixed ( if i am right ). But it seems problem persists for A dataset(wine) .
There is NO nan values in any rows/columns or full array of nan values.but i am facing this issue.

It would of great help if u guide me on this , unless i am wrongly coded. Thanks
Here is the dataset and code
wine.csv.zip
FEATURE_SELECTION_BORUTA.py.zip

XGBoost Support

Is BorutaPy compatible with XGBoost? If not, would you be interested in a PR for that compatibility (assuming it's possible and I can figure it out)?

It seems to me that this is not currently supported since I got an error when I tried it with XGBClassifier, but I wanted to know if there's any official word.

Thanks!

hits in sp.stats.binom

Hi,

Is there a reason why we use "hits-1" in "to_accept_ps", but "hits" in "to_reject_ps" when computing the binomial distribution:

to_accept_ps = sp.stats.binom.sf(hits - 1, _iter, .5).flatten()
to_reject_ps = sp.stats.binom.cdf(hits, _iter, .5).flatten()

Should'nt we use hits in both cases?

Many thanks

can't reproduce the example

i can't reproduce your example with my dataset .
here's the error that i'm getting

 File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 380, in _get_imp
    self.estimator.fit(X, y)
  File "/home/imahmoudi/python/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 272, in fit
    y, expanded_class_weight = self._validate_y_class_weight(y)
  File "/home/imahmoudi/python/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 493, in _validate_y_                                                       class_weight
    % self.class_weight)
ValueError: Valid presets for class_weight include "balanced" and "balanced_subsample". Given "auto".

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "FSelection.py", line 31, in <module>
    feat_selector.fit(X[:5000], y[:5000])
  File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 201, in fit
    return self._fit(X, y)
  File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 285, in _fit
    cur_imp = self._add_shadows_get_imps(X, y, dec_reg)
  File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 408, in _add_shadows_get_im                                                       ps
    imp = self._get_imp(np.hstack((x_cur, x_sha)), y)
  File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 383, in _get_imp
    'estimator cannot be fitted to your data.\n' + e)
TypeError: must be str, not ValueError

any idea how to solve this ?

boruta doesn't accept sparse matrices?

Hi, it looks like Boruta doesn't by default accept sparse matrices (I assume this is determined in line 517 of boruta_py.py, since the default value of accept_sparse is false for check_X_y). I get the following error:

A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

Is there any particular reason for this, given RandomForestClassifier does accept sparse matrices, at least in the more recent releases of sklearn?

Make boruta_py suitable for GridSearches

When using boruta_py in a sklearn gridsearch, the error object has no attribute 'get_params' occurs. It would be interesting, if one could also optimize the parameter of the boruta feature selection

Bug when doing feature selection

I got following exception on the completion of feature selection by boruta:

Iteration: 34 / 100
Confirmed: 10
Tentative: 0
Rejected: 432

IndexError Traceback (most recent call last)
in ()

E:\Anaconda3\lib\site-packages\boruta\boruta_py.py in fit(self, X, y)
199 """
200
--> 201 return self._fit(X, y)
202
203 def transform(self, X, weak=False):

E:\Anaconda3\lib\site-packages\boruta\boruta_py.py in _fit(self, X, y)
334
335 # update rank for not_selected features
--> 336 if not_selected.shape[0] > 0 and not_selected.shape[1] > 0:
337 # calculate ranks in each iteration, then median of ranks across feats
338 iter_ranks = self._nanrankdata(imp_history_rejected, axis=1)

IndexError: tuple index out of range

The issue seems to be caused by 52d504b

and could be fixed by merging following commit from guitarmind@f68cfcd

Transfer this project to scikit-learn-contrib

Hi,

this is not really an 'issue', only a suggestion. You could transfer this project to the scikit-learn-contrib which is a collection of algorithms compatible with scikit-learn, but not yet in a state where they are merged. This could greatly increase the visibility of your project. What do you think about this?

Ranking the features

Hi,
I want to get the ranking of features according to their scores, I know random forest can score each feature with different scores. feat_selector.ranking_ can output some ranked features, but many features have the same level , e.g. [ 1 1 1 1 1 1 1 28 7 12 16 17 2 31 32 25 30 4 3 24 28 133 386 415 426 117 493 407 185 453 202 310 199 73 74 302], could I ask how can I further rank the features for those features with the same level. thanks.

Question: High Collinearity, how does Boruta handle?

Hello,

Thanks for the package, I found it quite interesting.

When there are variables that are highly correlated, could that effect the Z-scores?

The only reason why I ask is in the past I have seen groups of highly correlated variables where the variables within that group have varied widely in their importance.

Would it make sense to handle the col-linearity problem before running Boruta?

Sincerely,
G

New release

@danielhomola Hi! There were quite a few bug fixes since 0.1.5. Do you mind making a new release a putting it on pypi.org?

On the use of pruned trees

The README states:
"We highly recommend using pruned trees with a depth between 3-7."

For my data, a depth of 3 rejects the fewest variables, increasing depth to 7 to a greater degree, and no pruning at all results in all variables being rejected.

I'm curious to know why this is, as I had expected that greater depth would afford greater sensitivity to subtle interactions between features.

Plotting the result

Is it possible to create box plots for the feature ranks? Similar to the one in R?

"No module names Boruta"

i cloned repositery.. pasted in my anaconda/Lib/site-packages

still its not working. Its showing "No module named boruta_py"

I correctly installed this package in anaconda/Lib/site-packages
but still its giving me the error "No module named boruta_py". I checked boruta is installed in there but still same error
I restarted my spyder idle several times

Running CLF on basic sklearn datasets

Python2.7 64, sklearn .18.1, Boruta .1.5

Results of running CLF on basic sklearn datasets:
IRIS fails
BREAST_CANCER fails
DIGITS completes

There error is:
File "C:\Python27\lib\site-packages\scipy\stats\mstats_basic.py", line 254, in _rank1d
for r in repeats[0]:
TypeError: iteration over a 0-d array

RandomForestClassifier params: n_estimators=3, max_depth=3
BorutaPy params: perc=100, alpha=.01

For BREAST_CANCER, when changing the classifier to max_depth of 1, the code runs.
For IRIS, when changing the classifier to max_depth of 1, the code still fails. Further, it seems like no parameters work for the IRIS dataset.

Is there a way to improve the stability? I saw this old issue/commit but the error seems more extensive.
80a74c1

n_jobs = -1 does not work

Hello,

I am trying to use all my cores when using boruta_py.
I set for instance, n_jobs=-1 inside randomforest class but only one core does the job.
I get no errors/warnings

Python3?

Just started using your code. I also just started using python 3.5. Do you have any plans for 3.5 support? If not I'll give it a shot.

idea: early stopping based on % tenative

I have a feature idea: maybe it would be possible to stop early if the number of tentative features reaches a threshold (possibly a percentage of the full feature set, or a specific number, or if we want to get fancy - a function parameter that returns a boolean.)

Why?
I noticed that in one instance, Boruta has less than 5% of my features marked as tentative after less than 10 rounds, but then it may take many many rounds to classify these 5%. In a lot of cases I would be fine just calling all of these confirmed.

I could work on a PR for this, but thought I would ask before I start working on it.

Categorical features

Hello,

More of a question than an issue, but how do you handle categorical features?

Also seems like y needs to be an int? Otherwise if I leave it as float I am getting an error.

Thanks!

Issue when run your example

Hi,

When I run your example code, at line 'feat_selector.fit(X,y)', I have red words 'TypeError: unhashable type: 'slice''. So I tried to change y = y.values and x = x.values. Then after 99 iterations (maxrun = 100), there is another red words 'TypeError: iteration over a 0-d array'.

So I was wondering what happen there... Thanks a lot

Results vary in R in Python implementation

It is mentioned "the two_step parameter has to be set to False, then (with perc=100) BorutaPy behaves exactly as the R version."
Inspite of doing this, results vary significantly. Is there a way to replicate the results exactly as the R version of the package?

When using Pandas DataFrames

_add_shadows_get_imps() fails when X is pandas rather than numpy

Pandas DF can no longer be sliced as
x_cur = np.copy(X[:, x_cur_ind])

x_cur = np.copy(X.as_matrix()[:, x_cur_ind])
OR
x_cur = np.copy(X.ix[:, x_cur_ind])

I'd recommend testing/casting dataframes to numpy arrays in _fit

Setup CI for automated tests

A Continuous Integration service should be configured in order on each PR the test suite will run automatically.

Cannot install boruta_py

Hello,

I am using python homebrew on my mac, and have some trouble installing boruta_py.
Could you help ?

Thanks !

iteration over a 0-d array in `_nanrankdata`

I gather that others have hit this (#12) but it still seems like a live issue, I'm afraid. It's hitting when X and y are ndarrays of what looks like the right shape.

There's a reproducible example on Iris data here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.