scikit-learn-contrib / boruta_py Goto Github PK
View Code? Open in Web Editor NEWPython implementations of the Boruta all-relevant feature selection method.
License: BSD 3-Clause "New" or "Revised" License
Python implementations of the Boruta all-relevant feature selection method.
License: BSD 3-Clause "New" or "Revised" License
When all variables are selected, this error will be thrown.
ValueError Traceback (most recent call last)
in ()
----> 1 feat_selector.fit(X,y)
/home/mike/dev/boruta_py/boruta/boruta_py2.py in fit(self, X, y)
170 """
171
--> 172 return self._fit(X, y)
173
174 def transform(self, X, weak=False):
/home/mike/dev/boruta_py/boruta/boruta_py2.py in fit(self, X, y)
322 else:
323 # and 2 otherwise
--> 324 ranks = ranks - np.min(ranks) + 2
325 self.ranking[not_selected] = ranks
326
/home/mike/anaconda/envs/Python2.7/lib/python2.7/site-packages/numpy/core/fromnumeric.pyc in amin(a, axis, out, keepdims)
2357 else:
2358 return _methods._amin(a, axis=axis,
-> 2359 out=out, keepdims=keepdims)
2360
2361
/home/mike/anaconda/envs/Python2.7/lib/python2.7/site-packages/numpy/core/_methods.pyc in _amin(a, axis, out, keepdims)
27
28 def _amin(a, axis=None, out=None, keepdims=False):
---> 29 return umr_minimum(a, axis, None, out, keepdims)
30
31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
ValueError: zero-size array to reduction operation minimum which has no identity
Debug Example:
selected = [0 1 2 3]
not_selected = []
imp_history_rejected = []
iter_ranks = []
rank_medians = []
ranks = []
(I'll send you a PR with fixes for these, just documenting them here so you know what the PR is doing.)
The examples provided apply Boruta for feature selection in classification problems. Can Boruta be accurately applied for feature selection in regression problems? If so, what regression estimator would be most appropriate? (i.e. RandomForestRegressor, GradientBoostingRegressor, etc.)
Hi, my working environment is python 3.6.4 and boruta is the current version. When I runed the example dataset in the example folder of boruta. The following error message will show "tuple index out of range".
It seems to be some errors happened in this line
"if not_selected.shape[0] > 0 and not_selected.shape[1] > 0:"
Dear all,
I'm getting acquainted with boruta and I have tried to execute the example Madalon_Data_Set.ipynb.
Everything works fine until
feat_selector.fit(X,y),
where I get the following error message:
`TypeError Traceback (most recent call last)
in ()
----> 1 feat_selector.fit(X,y)
/usr/local/lib/python3.5/dist-packages/boruta/boruta_py.py in fit(self, X, y)
199 """
200
--> 201 return self._fit(X, y)
202
203 def transform(self, X, weak=False):
/usr/local/lib/python3.5/dist-packages/boruta/boruta_py.py in _fit(self, X, y)
283
284 # add shadow attributes, shuffle them and train estimator, get imps
--> 285 cur_imp = self._add_shadows_get_imps(X, y, dec_reg)
286
287 # get the threshold of shadow importances we will use for rejection
/usr/local/lib/python3.5/dist-packages/boruta/boruta_py.py in _add_shadows_get_imps(self, X, y, dec_reg)
396 # find features that are tentative still
397 x_cur_ind = np.where(dec_reg >= 0)[0]
--> 398 x_cur = np.copy(X[:, x_cur_ind])
399 x_cur_w = x_cur.shape[1]
400 # deep copy the matrix for the shadow matrix
/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in getitem(self, key)
1962 return self._getitem_multilevel(key)
1963 else:
-> 1964 return self._getitem_column(key)
1965
1966 def _getitem_column(self, key):
/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in _getitem_column(self, key)
1969 # get column
1970 if self.columns.is_unique:
-> 1971 return self._get_item_cache(key)
1972
1973 # duplicate columns & possible reduce dimensionality
/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py in _get_item_cache(self, item)
1641 """Return the cached item, item represents a label indexer."""
1642 cache = self._item_cache
-> 1643 res = cache.get(item)
1644 if res is None:
1645 values = self._data.get(item)
TypeError: unhashable type: 'slice'`
Is there a simple solution?
I'm working on a linux machine with python 3.5. My python packages are up to date.
Many thanks in advance,
Flavio
Dear all
Run Madalon_Data_Set.ipynb
, and I got only 1 important feature.
According to the original paper, Madalon data set contains 20 important features.
But, I could get one feature. Why?
On the other hand, I could get approximately 20 features when I set two_step=False
.
My results are following.
two step is True
rf = RandomForestClassifier(n_jobs=int(cpu_count()/2), class_weight='balanced', max_depth=7)
# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', two_step=True, verbose=2, random_state=42)
BorutaPy finished running.
Iteration: 35 / 100
Confirmed: 1
Tentative: 0
Rejected: 498
two step is false
rf = RandomForestClassifier(n_jobs=int(cpu_count()/2), class_weight='balanced', max_depth=7)
# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', two_step=False, verbose=2, random_state=42)
BorutaPy finished running.
Iteration: 55 / 100
Confirmed: 21
Tentative: 0
Rejected: 478
Hi,
it seems that throughout the _print_results
method, the Confirmed + Tentative + Rejected is correct except for the final cycle in which it appears the "Tentative" value is incorrect. Could this mean self.support_weak_
is not always being updated correctly?
Here is an example of the output (see last cycle, where the values don't sum to the number of features?):
Iteration: 1 / 150
Confirmed: 0
Tentative: 47
Rejected: 0
....
Iteration: 148 / 150
Confirmed: 10
Tentative: 2
Rejected: 35
Iteration: 149 / 150
Confirmed: 10
Tentative: 2
Rejected: 35
BorutaPy finished running.
Iteration: 150 / 150
Confirmed: 10
Tentative: 1
Rejected: 35
It seems that boruta passes RandomState(MT19937) to the model it is fitting regardless of the model's parameters. This doesn't bother a random forest model, but causes an xgBoost model to fail with the following error:
ValueError: Please check your X and y variable. The provided estimator cannot be fitted to your data. Invalid Parameter format for seed expect int but value='RandomState(MT19937)'
Thank you for your nice implementation of boruta!
However, I am wondering why you reimplemented the nanrankdata method of bottleneck. The reason I am wondering is that your new implementation can not handle cases, where the input data is in the form
X = [..., [nan, nan, nan], ...]
whereas the bottleneck.nanrankdata can (and this case occurs in my examples). I am sure there was a good reason to not use this method from bottleneck....
Thank you for your help!
We could add some unit tests. Maybe some border cases with obvious irrelevant features.
I have been trying to pip3 install Boruta but it is not working, returning the following error:
Collecting Boruta
Could not find a version that satisfies the requirement Botura (from versions: )
No matching distribution found for Botura
Any idea how to resolve this issue?
Thanks
donot suport lgb/xgb?
Dear Dan - thanks for the code, I was having issues with the 0 array rerpoted elsewhere which you had fixed by posting that the package should be cloned direcrt from Github using which I did ,
!git clone https://github.com/scikit-learn-contrib/boruta_py.git
If I then attempt to (per the example here: http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/ import BorutaPy from .boruta_py or boruta_py, the module doesnt exist:
ImportError: cannot import name 'boruta_py' from 'boruta_py' (unknown location)
whereas while installing boruta_py via git install was succesful:
(base) C:\Users\Amin>git clone https://github.com/scikit-learn-contrib/boruta_py.git
Cloning into 'boruta_py'...
remote: Enumerating objects: 17, done.
remote: Counting objects: 100% (17/17), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 284 (delta 8), reused 16 (delta 7), pack-reused 267
Receiving objects: 100% (284/284), 146.78 KiB | 853.00 KiB/s, done.
Resolving deltas: 100% (134/134), done.
do you have advice on how to fix this? the pip version will have the 0 array issue and this one cannot locate Boruta_py
Does it work with ensembles besides randomforest?
Hi Team -
I have faced this "iteration over a 0-d array" for a specific data set and read all the QA and understood it is fixed ( if i am right ). But it seems problem persists for A dataset(wine) .
There is NO nan values in any rows/columns or full array of nan values.but i am facing this issue.
It would of great help if u guide me on this , unless i am wrongly coded. Thanks
Here is the dataset and code
wine.csv.zip
FEATURE_SELECTION_BORUTA.py.zip
Is BorutaPy compatible with XGBoost? If not, would you be interested in a PR for that compatibility (assuming it's possible and I can figure it out)?
It seems to me that this is not currently supported since I got an error when I tried it with XGBClassifier, but I wanted to know if there's any official word.
Thanks!
Hi,
Is there a reason why we use "hits-1" in "to_accept_ps", but "hits" in "to_reject_ps" when computing the binomial distribution:
to_accept_ps = sp.stats.binom.sf(hits - 1, _iter, .5).flatten()
to_reject_ps = sp.stats.binom.cdf(hits, _iter, .5).flatten()
Should'nt we use hits in both cases?
Many thanks
i can't reproduce your example with my dataset .
here's the error that i'm getting
File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 380, in _get_imp
self.estimator.fit(X, y)
File "/home/imahmoudi/python/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 272, in fit
y, expanded_class_weight = self._validate_y_class_weight(y)
File "/home/imahmoudi/python/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 493, in _validate_y_ class_weight
% self.class_weight)
ValueError: Valid presets for class_weight include "balanced" and "balanced_subsample". Given "auto".
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "FSelection.py", line 31, in <module>
feat_selector.fit(X[:5000], y[:5000])
File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 201, in fit
return self._fit(X, y)
File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 285, in _fit
cur_imp = self._add_shadows_get_imps(X, y, dec_reg)
File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 408, in _add_shadows_get_im ps
imp = self._get_imp(np.hstack((x_cur, x_sha)), y)
File "/home/imahmoudi/python/lib/python3.6/site-packages/boruta/boruta_py.py", line 383, in _get_imp
'estimator cannot be fitted to your data.\n' + e)
TypeError: must be str, not ValueError
any idea how to solve this ?
boruta_py release on pypi is 0.3 but only 0.1.5 on github. Is there any difference or is 0.3 just equivalent to the latest state of repo here?
how to install this on windows computer?
Hi,
I am trying to play with Boruta and sometimes I see that this algorithm is rejecting more features than it should. What parameters can I tune either in Boruta Algorithm or Random Forest in order to have it working properly?
Hi, it looks like Boruta doesn't by default accept sparse matrices (I assume this is determined in line 517 of boruta_py.py, since the default value of accept_sparse
is false for check_X_y
). I get the following error:
A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
Is there any particular reason for this, given RandomForestClassifier does accept sparse matrices, at least in the more recent releases of sklearn?
When using boruta_py in a sklearn gridsearch, the error object has no attribute 'get_params'
occurs. It would be interesting, if one could also optimize the parameter of the boruta feature selection
d:\Anaconda3\lib\site-packages\boruta\boruta_py.py:418: RuntimeWarning: invalid value encountered in greater
hits = np.where(cur_imp[0] > imp_sha_max)[0]
I got following exception on the completion of feature selection by boruta:
IndexError Traceback (most recent call last)
in ()
E:\Anaconda3\lib\site-packages\boruta\boruta_py.py in fit(self, X, y)
199 """
200
--> 201 return self._fit(X, y)
202
203 def transform(self, X, weak=False):
E:\Anaconda3\lib\site-packages\boruta\boruta_py.py in _fit(self, X, y)
334
335 # update rank for not_selected features
--> 336 if not_selected.shape[0] > 0 and not_selected.shape[1] > 0:
337 # calculate ranks in each iteration, then median of ranks across feats
338 iter_ranks = self._nanrankdata(imp_history_rejected, axis=1)
IndexError: tuple index out of range
The issue seems to be caused by 52d504b
and could be fixed by merging following commit from guitarmind@f68cfcd
Hi,
this is not really an 'issue', only a suggestion. You could transfer this project to the scikit-learn-contrib which is a collection of algorithms compatible with scikit-learn, but not yet in a state where they are merged. This could greatly increase the visibility of your project. What do you think about this?
Hi,
I want to get the ranking of features according to their scores, I know random forest can score each feature with different scores. feat_selector.ranking_ can output some ranked features, but many features have the same level , e.g. [ 1 1 1 1 1 1 1 28 7 12 16 17 2 31 32 25 30 4 3 24 28 133 386 415 426 117 493 407 185 453 202 310 199 73 74 302], could I ask how can I further rank the features for those features with the same level. thanks.
Hello,
Thanks for the package, I found it quite interesting.
When there are variables that are highly correlated, could that effect the Z-scores?
The only reason why I ask is in the past I have seen groups of highly correlated variables where the variables within that group have varied widely in their importance.
Would it make sense to handle the col-linearity problem before running Boruta?
Sincerely,
G
If a tentative feature is rejected in the last iteration, the support_weak_
mask is adapted correctly, however, the ranking_
array isn't updated so that the rank of the rejected feature is still 2
@danielhomola Hi! There were quite a few bug fixes since 0.1.5. Do you mind making a new release a putting it on pypi.org?
The README states:
"We highly recommend using pruned trees with a depth between 3-7."
For my data, a depth of 3 rejects the fewest variables, increasing depth to 7 to a greater degree, and no pruning at all results in all variables being rejected.
I'm curious to know why this is, as I had expected that greater depth would afford greater sensitivity to subtle interactions between features.
Is it possible to create box plots for the feature ranks? Similar to the one in R?
i cloned repositery.. pasted in my anaconda/Lib/site-packages
still its not working. Its showing "No module named boruta_py"
I correctly installed this package in anaconda/Lib/site-packages
but still its giving me the error "No module named boruta_py". I checked boruta is installed in there but still same error
I restarted my spyder idle several times
Python2.7 64, sklearn .18.1, Boruta .1.5
Results of running CLF on basic sklearn datasets:
IRIS fails
BREAST_CANCER fails
DIGITS completes
There error is:
File "C:\Python27\lib\site-packages\scipy\stats\mstats_basic.py", line 254, in _rank1d
for r in repeats[0]:
TypeError: iteration over a 0-d array
RandomForestClassifier params: n_estimators=3, max_depth=3
BorutaPy params: perc=100, alpha=.01
For BREAST_CANCER, when changing the classifier to max_depth of 1, the code runs.
For IRIS, when changing the classifier to max_depth of 1, the code still fails. Further, it seems like no parameters work for the IRIS dataset.
Is there a way to improve the stability? I saw this old issue/commit but the error seems more extensive.
80a74c1
Hello,
I am trying to use all my cores when using boruta_py.
I set for instance, n_jobs=-1 inside randomforest class but only one core does the job.
I get no errors/warnings
Just started using your code. I also just started using python 3.5. Do you have any plans for 3.5 support? If not I'll give it a shot.
I have a feature idea: maybe it would be possible to stop early if the number of tentative features reaches a threshold (possibly a percentage of the full feature set, or a specific number, or if we want to get fancy - a function parameter that returns a boolean.)
Why?
I noticed that in one instance, Boruta has less than 5% of my features marked as tentative after less than 10 rounds, but then it may take many many rounds to classify these 5%. In a lot of cases I would be fine just calling all of these confirmed.
I could work on a PR for this, but thought I would ask before I start working on it.
Hello, do you consider uploading this package to pypi and anaconda ?
Hello,
More of a question than an issue, but how do you handle categorical features?
Also seems like y needs to be an int? Otherwise if I leave it as float I am getting an error.
Thanks!
I just run the example offered in readme, but I get a error "tuple index out of range".
Hi,
When I run your example code, at line 'feat_selector.fit(X,y)', I have red words 'TypeError: unhashable type: 'slice''. So I tried to change y = y.values and x = x.values. Then after 99 iterations (maxrun = 100), there is another red words 'TypeError: iteration over a 0-d array'.
So I was wondering what happen there... Thanks a lot
It is mentioned "the two_step parameter has to be set to False, then (with perc=100) BorutaPy behaves exactly as the R version."
Inspite of doing this, results vary significantly. Is there a way to replicate the results exactly as the R version of the package?
_add_shadows_get_imps() fails when X is pandas rather than numpy
Pandas DF can no longer be sliced as
x_cur = np.copy(X[:, x_cur_ind])
x_cur = np.copy(X.as_matrix()[:, x_cur_ind])
OR
x_cur = np.copy(X.ix[:, x_cur_ind])
I'd recommend testing/casting dataframes to numpy arrays in _fit
This project hosts Python implementations of the Boruta all-relevant feature selection method.
The URL in README is broken
A Continuous Integration service should be configured in order on each PR the test suite will run automatically.
Hello,
I am using python homebrew on my mac, and have some trouble installing boruta_py.
Could you help ?
Thanks !
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.