scikit-learn-contrib / boruta_py Goto Github PK

View Code? Open in Web Editor NEW

1.5K 40.0 252.0 153 KB

Python implementations of the Boruta all-relevant feature selection method.

License: BSD 3-Clause "New" or "Revised" License

Python 71.26% Jupyter Notebook 28.74%

boruta_py's Issues

Question: High Collinearity, how does Boruta handle?

Hello,

Thanks for the package, I found it quite interesting.

When there are variables that are highly correlated, could that effect the Z-scores?

The only reason why I ask is in the past I have seen groups of highly correlated variables where the variables within that group have varied widely in their importance.

Would it make sense to handle the col-linearity problem before running Boruta?

Sincerely,
G

On the use of pruned trees

The README states:
"We highly recommend using pruned trees with a depth between 3-7."

For my data, a depth of 3 rejects the fewest variables, increasing depth to 7 to a greater degree, and no pruning at all results in all variables being rejected.

I'm curious to know why this is, as I had expected that greater depth would afford greater sensitivity to subtle interactions between features.

Bug when doing feature selection

I got following exception on the completion of feature selection by boruta:

Iteration: 34 / 100
Confirmed: 10
Tentative: 0
Rejected: 432

IndexError Traceback (most recent call last)
in ()

E:\Anaconda3\lib\site-packages\boruta\boruta_py.py in fit(self, X, y)
199 """
200
--> 201 return self._fit(X, y)
202
203 def transform(self, X, weak=False):

E:\Anaconda3\lib\site-packages\boruta\boruta_py.py in _fit(self, X, y)
334
335 # update rank for not_selected features
--> 336 if not_selected.shape[0] > 0 and not_selected.shape[1] > 0:
337 # calculate ranks in each iteration, then median of ranks across feats
338 iter_ranks = self._nanrankdata(imp_history_rejected, axis=1)

IndexError: tuple index out of range

The issue seems to be caused by 52d504b

and could be fixed by merging following commit from guitarmind@f68cfcd

"TypeError: unhashable type" when calling feat_selector.fit(X,y) in the Madalon_Data_set.ipynb example

Hello,

I was able to pip install boruta no problem.
I then started stepping through the Madalon_Data_set ipython notebook.
When I get to feat_selector.fit(X,y), I received the following error:

TypeError: unhashable type

I am using python27, 32bit.
I have pandas-0.17

regards,

Manny

New release

@danielhomola Hi! There were quite a few bug fixes since 0.1.5. Do you mind making a new release a putting it on pypi.org?

Acces Z-score individual variables

Is it possible to access the individual Z-scores of variables? Such as to make a visualization that has been done in Fig. 2 of the original paper.

"No module names Boruta"

i cloned repositery.. pasted in my anaconda/Lib/site-packages

still its not working. Its showing "No module named boruta_py"

I correctly installed this package in anaconda/Lib/site-packages
but still its giving me the error "No module named boruta_py". I checked boruta is installed in there but still same error
I restarted my spyder idle several times

Cannot install boruta_py

Hello,

I am using python homebrew on my mac, and have some trouble installing boruta_py.
Could you help ?

Thanks !

Example Madalon_Data_Set.ipynb does not work

Dear all,
I'm getting acquainted with boruta and I have tried to execute the example Madalon_Data_Set.ipynb.
Everything works fine until

feat_selector.fit(X,y),

where I get the following error message:
`TypeError Traceback (most recent call last)
in ()
----> 1 feat_selector.fit(X,y)

/usr/local/lib/python3.5/dist-packages/boruta/boruta_py.py in fit(self, X, y)
199 """
200
--> 201 return self._fit(X, y)
202
203 def transform(self, X, weak=False):

/usr/local/lib/python3.5/dist-packages/boruta/boruta_py.py in _fit(self, X, y)
283
284 # add shadow attributes, shuffle them and train estimator, get imps
--> 285 cur_imp = self._add_shadows_get_imps(X, y, dec_reg)
286
287 # get the threshold of shadow importances we will use for rejection

/usr/local/lib/python3.5/dist-packages/boruta/boruta_py.py in _add_shadows_get_imps(self, X, y, dec_reg)
396 # find features that are tentative still
397 x_cur_ind = np.where(dec_reg >= 0)[0]
--> 398 x_cur = np.copy(X[:, x_cur_ind])
399 x_cur_w = x_cur.shape[1]
400 # deep copy the matrix for the shadow matrix

/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in getitem(self, key)
1962 return self._getitem_multilevel(key)
1963 else:
-> 1964 return self._getitem_column(key)
1965
1966 def _getitem_column(self, key):

/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in _getitem_column(self, key)
1969 # get column
1970 if self.columns.is_unique:
-> 1971 return self._get_item_cache(key)
1972
1973 # duplicate columns & possible reduce dimensionality

/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py in _get_item_cache(self, item)
1641 """Return the cached item, item represents a label indexer."""
1642 cache = self._item_cache
-> 1643 res = cache.get(item)
1644 if res is None:
1645 values = self._data.get(item)

TypeError: unhashable type: 'slice'`

Is there a simple solution?

I'm working on a linux machine with python 3.5. My python packages are up to date.

Many thanks in advance,
Flavio

Don't work well in Madalon Data set when I set `two_step=True`

Dear all

Run Madalon_Data_Set.ipynb, and I got only 1 important feature.
According to the original paper, Madalon data set contains 20 important features.
But, I could get one feature. Why?

On the other hand, I could get approximately 20 features when I set two_step=False.

My results are following.

two step is True

rf = RandomForestClassifier(n_jobs=int(cpu_count()/2), class_weight='balanced', max_depth=7)
# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', two_step=True, verbose=2, random_state=42)

BorutaPy finished running.

Iteration: 	35 / 100
Confirmed: 	1
Tentative: 	0
Rejected: 	498

two step is false

rf = RandomForestClassifier(n_jobs=int(cpu_count()/2), class_weight='balanced', max_depth=7)
# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', two_step=False, verbose=2, random_state=42)

BorutaPy finished running.

Iteration: 	55 / 100
Confirmed: 	21
Tentative: 	0
Rejected: 	478

TypeError: must be str, not ValueError

Ranking the features

Hi,
I want to get the ranking of features according to their scores, I know random forest can score each feature with different scores. feat_selector.ranking_ can output some ranked features, but many features have the same level , e.g. [ 1 1 1 1 1 1 1 28 7 12 16 17 2 31 32 25 30 4 3 24 28 133 386 415 426 117 493 407 185 453 202 310 199 73 74 302], could I ask how can I further rank the features for those features with the same level. thanks.

Tuple Index Out of range

Hi, my working environment is python 3.6.4 and boruta is the current version. When I runed the example dataset in the example folder of boruta. The following error message will show "tuple index out of range".
It seems to be some errors happened in this line
"if not_selected.shape[0] > 0 and not_selected.shape[1] > 0:"

Categorical features

Hello,

More of a question than an issue, but how do you handle categorical features?

Also seems like y needs to be an int? Otherwise if I leave it as float I am getting an error.

Thanks!

_nanrankdata

Thank you for your nice implementation of boruta!
However, I am wondering why you reimplemented the nanrankdata method of bottleneck. The reason I am wondering is that your new implementation can not handle cases, where the input data is in the form

X = [..., [nan, nan, nan], ...]

whereas the bottleneck.nanrankdata can (and this case occurs in my examples). I am sure there was a good reason to not use this method from bottleneck....

Thank you for your help!

Confirmed + Tentative + Rejected not always equal to number of features in final iteration?

Hi,

it seems that throughout the _print_results method, the Confirmed + Tentative + Rejected is correct except for the final cycle in which it appears the "Tentative" value is incorrect. Could this mean self.support_weak_ is not always being updated correctly?

Here is an example of the output (see last cycle, where the values don't sum to the number of features?):

Iteration: 1 / 150
Confirmed: 0
Tentative: 47
Rejected: 0

....

Iteration: 148 / 150
Confirmed: 10
Tentative: 2
Rejected: 35

Iteration: 149 / 150
Confirmed: 10
Tentative: 2
Rejected: 35

BorutaPy finished running.

Iteration: 150 / 150
Confirmed: 10
Tentative: 1
Rejected: 35

URL broken in Readme

This project hosts Python implementations of the Boruta all-relevant feature selection method.

The URL in README is broken

XGBoost Support

Is BorutaPy compatible with XGBoost? If not, would you be interested in a PR for that compatibility (assuming it's possible and I can figure it out)?

It seems to me that this is not currently supported since I got an error when I tried it with XGBClassifier, but I wanted to know if there's any official word.

Thanks!

iteration over a 0-d array in `_nanrankdata`

I gather that others have hit this (#12) but it still seems like a live issue, I'm afraid. It's hitting when X and y are ndarrays of what looks like the right shape.

There's a reproducible example on Iris data here.

idea: early stopping based on % tenative

I have a feature idea: maybe it would be possible to stop early if the number of tentative features reaches a threshold (possibly a percentage of the full feature set, or a specific number, or if we want to get fancy - a function parameter that returns a boolean.)

Why?
I noticed that in one instance, Boruta has less than 5% of my features marked as tentative after less than 10 rounds, but then it may take many many rounds to classify these 5%. In a lot of cases I would be fine just calling all of these confirmed.

I could work on a PR for this, but thought I would ask before I start working on it.

boruta doesn't accept sparse matrices?

Hi, it looks like Boruta doesn't by default accept sparse matrices (I assume this is determined in line 517 of boruta_py.py, since the default value of accept_sparse is false for check_X_y). I get the following error:

A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

Is there any particular reason for this, given RandomForestClassifier does accept sparse matrices, at least in the more recent releases of sklearn?

how to install this on windows computer?

Setup CI for automated tests

A Continuous Integration service should be configured in order on each PR the test suite will run automatically.

Iterating over 0-d array

When running boruta if all features get selected the code gives error when _nanrankdata is called under BorutaPy

Release number discrepancy between GitHub and pypi

boruta_py release on pypi is 0.3 but only 0.1.5 on github. Is there any difference or is 0.3 just equivalent to the latest state of repo here?

donot suport lgb/xgb?

Plotting the result

Is it possible to create box plots for the feature ranks? Similar to the one in R?

Python3?

Just started using your code. I also just started using python 3.5. Do you have any plans for 3.5 support? If not I'll give it a shot.

When using Pandas DataFrames

_add_shadows_get_imps() fails when X is pandas rather than numpy

Pandas DF can no longer be sliced as
x_cur = np.copy(X[:, x_cur_ind])

x_cur = np.copy(X.as_matrix()[:, x_cur_ind])
OR
x_cur = np.copy(X.ix[:, x_cur_ind])

I'd recommend testing/casting dataframes to numpy arrays in _fit

Error: run example get tunple index out of range

I just run the example offered in readme, but I get a error "tuple index out of range".

Running CLF on basic sklearn datasets

Python2.7 64, sklearn .18.1, Boruta .1.5

Results of running CLF on basic sklearn datasets:
IRIS fails
BREAST_CANCER fails
DIGITS completes

There error is:
File "C:\Python27\lib\site-packages\scipy\stats\mstats_basic.py", line 254, in _rank1d
for r in repeats[0]:
TypeError: iteration over a 0-d array

RandomForestClassifier params: n_estimators=3, max_depth=3
BorutaPy params: perc=100, alpha=.01

For BREAST_CANCER, when changing the classifier to max_depth of 1, the code runs.
For IRIS, when changing the classifier to max_depth of 1, the code still fails. Further, it seems like no parameters work for the IRIS dataset.

Is there a way to improve the stability? I saw this old issue/commit but the error seems more extensive.
80a74c1

Overwrites model's random state

It seems that boruta passes RandomState(MT19937) to the model it is fitting regardless of the model's parameters. This doesn't bother a random forest model, but causes an xgBoost model to fail with the following error:

ValueError: Please check your X and y variable. The provided estimator cannot be fitted to your data. Invalid Parameter format for seed expect int but value='RandomState(MT19937)'

Transfer this project to scikit-learn-contrib

Hi,

this is not really an 'issue', only a suggestion. You could transfer this project to the scikit-learn-contrib which is a collection of algorithms compatible with scikit-learn, but not yet in a state where they are merged. This could greatly increase the visibility of your project. What do you think about this?

Better test coverage

We could add some unit tests. Maybe some border cases with obvious irrelevant features.

upload to pypi and anaconda

Hello, do you consider uploading this package to pypi and anaconda ?

Make boruta_py suitable for GridSearches

When using boruta_py in a sklearn gridsearch, the error object has no attribute 'get_params' occurs. It would be interesting, if one could also optimize the parameter of the boruta feature selection

hits in sp.stats.binom

Hi,

Is there a reason why we use "hits-1" in "to_accept_ps", but "hits" in "to_reject_ps" when computing the binomial distribution:

to_accept_ps = sp.stats.binom.sf(hits - 1, _iter, .5).flatten()
to_reject_ps = sp.stats.binom.cdf(hits, _iter, .5).flatten()

Should'nt we use hits in both cases?

Many thanks

iteration over a 0-d array

Hi Team -
I have faced this "iteration over a 0-d array" for a specific data set and read all the QA and understood it is fixed ( if i am right ). But it seems problem persists for A dataset(wine) .
There is NO nan values in any rows/columns or full array of nan values.but i am facing this issue.

It would of great help if u guide me on this , unless i am wrongly coded. Thanks
Here is the dataset and code
wine.csv.zip
FEATURE_SELECTION_BORUTA.py.zip

How to use boruta_py with BaggingClassifier?

Does it work with ensembles besides randomforest?

n_jobs = -1 does not work

Hello,

I am trying to use all my cores when using boruta_py.
I set for instance, n_jobs=-1 inside randomforest class but only one core does the job.
I get no errors/warnings

d:\Anaconda3\lib\site-packages\boruta\boruta_py.py:418: RuntimeWarning: invalid value encountered in greater hits = np.where(cur_imp[0] > imp_sha_max)[0]

d:\Anaconda3\lib\site-packages\boruta\boruta_py.py:418: RuntimeWarning: invalid value encountered in greater
hits = np.where(cur_imp[0] > imp_sha_max)[0]

Issue when run your example

Hi,

When I run your example code, at line 'feat_selector.fit(X,y)', I have red words 'TypeError: unhashable type: 'slice''. So I tried to change y = y.values and x = x.values. Then after 99 iterations (maxrun = 100), there is another red words 'TypeError: iteration over a 0-d array'.

So I was wondering what happen there... Thanks a lot

Pip install Botura no working

I have been trying to pip3 install Boruta but it is not working, returning the following error:

Collecting Boruta
Could not find a version that satisfies the requirement Botura (from versions: )
No matching distribution found for Botura

Any idea how to resolve this issue?

Thanks

Question: Feature Selection for Regression Problems

The examples provided apply Boruta for feature selection in classification problems. Can Boruta be accurately applied for feature selection in regression problems? If so, what regression estimator would be most appropriate? (i.e. RandomForestRegressor, GradientBoostingRegressor, etc.)

Error when all variables are selected

When all variables are selected, this error will be thrown.

ValueError Traceback (most recent call last)
in ()
----> 1 feat_selector.fit(X,y)

/home/mike/dev/boruta_py/boruta/boruta_py2.py in fit(self, X, y)
170 """
171
--> 172 return self._fit(X, y)
173
174 def transform(self, X, weak=False):

/home/mike/dev/boruta_py/boruta/boruta_py2.py in fit(self, X, y)
322 else:
323 # and 2 otherwise
--> 324 ranks = ranks - np.min(ranks) + 2
325 self.ranking[not_selected] = ranks
326

/home/mike/anaconda/envs/Python2.7/lib/python2.7/site-packages/numpy/core/fromnumeric.pyc in amin(a, axis, out, keepdims)
2357 else:
2358 return _methods._amin(a, axis=axis,
-> 2359 out=out, keepdims=keepdims)
2360
2361

/home/mike/anaconda/envs/Python2.7/lib/python2.7/site-packages/numpy/core/_methods.pyc in _amin(a, axis, out, keepdims)
27
28 def _amin(a, axis=None, out=None, keepdims=False):
---> 29 return umr_minimum(a, axis, None, out, keepdims)
30
31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):

ValueError: zero-size array to reduction operation minimum which has no identity

Debug Example:

selected = [0 1 2 3]
not_selected = []
imp_history_rejected = []
iter_ranks = []
rank_medians = []
ranks = []

(I'll send you a PR with fixes for these, just documenting them here so you know what the PR is doing.)

Results vary in R in Python implementation

It is mentioned "the two_step parameter has to be set to False, then (with perc=100) BorutaPy behaves exactly as the R version."
Inspite of doing this, results vary significantly. Is there a way to replicate the results exactly as the R version of the package?

BorutaPy not existent in boruta_py

Dear Dan - thanks for the code, I was having issues with the 0 array rerpoted elsewhere which you had fixed by posting that the package should be cloned direcrt from Github using which I did ,
!git clone https://github.com/scikit-learn-contrib/boruta_py.git
If I then attempt to (per the example here: http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/ import BorutaPy from .boruta_py or boruta_py, the module doesnt exist:
ImportError: cannot import name 'boruta_py' from 'boruta_py' (unknown location)
whereas while installing boruta_py via git install was succesful:

(base) C:\Users\Amin>git clone https://github.com/scikit-learn-contrib/boruta_py.git
Cloning into 'boruta_py'...
remote: Enumerating objects: 17, done.
remote: Counting objects: 100% (17/17), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 284 (delta 8), reused 16 (delta 7), pack-reused 267
Receiving objects: 100% (284/284), 146.78 KiB | 853.00 KiB/s, done.
Resolving deltas: 100% (134/134), done.

do you have advice on how to fix this? the pip version will have the 0 array issue and this one cannot locate Boruta_py

What to do when Boruta seems to be rejecting more features than it should?

Hi,

I am trying to play with Boruta and sometimes I see that this algorithm is rejecting more features than it should. What parameters can I tune either in Boruta Algorithm or Random Forest in order to have it working properly?

support_weak_ and ranking_ for tentative features do not coincide

If a tentative feature is rejected in the last iteration, the support_weak_ mask is adapted correctly, however, the ranking_ array isn't updated so that the rank of the rejected feature is still 2

scikit-learn-contrib / boruta_py Goto Github PK

boruta_py's Issues

Iteration: 34 / 100 Confirmed: 10 Tentative: 0 Rejected: 432

Recommend Projects

Recommend Topics

Recommend Org

Iteration: 34 / 100
Confirmed: 10
Tentative: 0
Rejected: 432