Comments (8)
I would close this issue and refer to feature_engine
for feature elimination based on correlation. I believe it make more sense to suggest change in feature_engine
existing functionality than build a new one for Probatus
from probatus.
Could be nice to consider the removal based on the expect impact. For example two features correlated at 0.90 could have different correlation with the target. Then select the one with the highest correlation.
from probatus.
@gverbock indeed that would be a nice additional strategy.
It requires passing y to the object, so I would start with the simpler strategies first. However, if we decide to develop it, I will make an issue for that!
from probatus.
@Matgrb not sure what is the point of removing correlated features iteratively? Why not save some computational power are remove all feature above threshold H in one go after you have the correlation matrix?
@gverbock @Matgrb Regarding some more simple inteligence, you can create a feature rank, which measures number of pairs where a given feature has correlation above H and then use this rank as additional elimination rule. Assume u have X1, X2, X3 and X4. X1 is correlated with X4 and X1 is correlated with X2, other features are not correlated among each other. In this case it makes sense to remove X1 and not consider removing X4 or X2.
In regards to categorical features, there is no easy solution:
- If you want to measure correlation between two categorical feats you can use Chi-sqr test, but there are a lot of options, especially if they are ordinal
- If you want to measure correlation between categorial and continuous you can use ANOVA, or trend tests
I would propose to start with a simple implementation with Pearson correlation as described in the initial issue description. I can also add the ranking I mentioned above. If you agree, I can pick it up and make a PR.
from probatus.
The main point for doing this iteratively is the following situation
A, B, C, D features
Correlations A-B 0.95, A-C 0.9, A-D 0.8, B-C 0.95, B-D 0.8, C-D 0.8
Correlated features above or equal to 0.95 are A and B. Let's say we remove iteratively, then we remove one of them only, and after that, we don't have to remove B, because it is not correlated to other features anymore, and has information that other features are missing.
If we start with just removing all correlated features, we lose information, that both A and B had, that C and D were missing.
I think doing it iteratively does not cause a lot of performance drop, because you already have precomputed correlated matrix for the entire dataset, and you only remove columns and rows from it, but it gives flexibility in case the user wants to select which feature to remove if there are two correlated ones.
An example of this is:
the business prefers to work with features based features that they understand better, so from the 2 correlated ones they would choose, whichever suits them best.
I agree, let's start with Pearson correlation and numeric features. I would propose to also do this iteratively, and for now just select randomly one of the two. Later we can add other ways of selecting. Please make a class in feature_elimination, that follows the API in other classes e.g. ShapRFECV. init, fit, fit_compute, compute, plot. Similarly the parameter names in case they overlap. Also you can add docstrings already
Once you make a PR we can discuss there what other steps are needed e.g.
- code example in docstring
- unit tests
- docs notebook
from probatus.
Ok I see the point, I though you were recomputing the correlation after every step, which got me confused :) Can you assign the issue to me?
from probatus.
If I understand correctly, feature-engine
provides similar functionality.
Do we still find this suggestion important or shall we close it, since it has been inactive for so long?
from probatus.
This issue is closed as feature engine provides this functionality.
from probatus.
Related Issues (20)
- Update Probatus to use the latest version of SHAP HOT 23
- Antivirus blacklisted and blocked Probatus website HOT 7
- Option early_stopping_rounds missing for LightGBM in ShapRFECV HOT 12
- Patch release v2.1.1 HOT 2
- Spark Support of ShapRFECV HOT 3
- python3.12 support HOT 2
- Support for shap==0.43.0 HOT 6
- AttributeError: module 'numpy' has no attribute 'bool'. HOT 2
- Random state not set consistently. HOT 1
- Add explicit support for regressors next to classifiers HOT 1
- Introduce dependabot for help with dependency updates
- Investigate if parts of the codebase can leverage other libraries code HOT 2
- Update all notebooks according to latest code. HOT 1
- Probatus v3.0.0+ missing features & issues. HOT 2
- Add a notebook which shows the use of Probatus with pySpark
- Add seed to explainer + remove np.random.state() HOT 1
- Create a new tag HOT 3
- eval_metric in EarlyStoppingShapRFECV not used for LGBMClassifier HOT 1
- Replace Matplotlib by Plotly for interactive graphs
- Got error running shap_elimination.fit_compute() HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from probatus.