Comments (3)
Hi @VadimKufenko, thank you for raising the issue, I will look into it in the next couple of days!
from smote_variants.
Dear György @gykovacs , thank you so much for following up! This is very kind of you! I would like to express my fascination with the smote_variants package - a tremendous work!
Meanwhile I tried different arrangements of the parameters since for lists and sets one gets the following error:
ValueError: n_estimators must be an integer, got <class 'set'>
ValueError: n_estimators must be an integer, got <class 'list'>
I saw that set_params(self, **parameters) in the OversamplingClassifier uses dictionaries, but in the end it pins down to integers, so I tried to improvise further. Please see my examples with generated data below - perhaps it helps to identify the issue. With gridsearch, depending on the arrangement, either the i) top parameters are optimized with the rest ignored, or ii) the last ones are listed as the optimal ones, although they can not be optimal.
Please note that I am using imblearn.pipeline, but I had similar issues with the sklearn pipeline.
Example
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from imblearn.pipeline import Pipeline, make_pipeline # IMBLEARN pipeline
from sklearn.metrics import precision_score, recall_score, average_precision_score, f1_score, recall_score, make_scorer
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
import smote_variants as sv
from smote_variants.classifiers import OversamplingClassifier
X, y = make_classification(n_samples=2000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, n_clusters_per_class=1, weights=[0.8,0.1,0.1], random_state=42)
oversampler = ('smote_variants', 'MulticlassOversampling',
{'oversampler': 'MWMOTE', 'oversampler_params': {'proportion': 1.0}})
classifier = ('sklearn.ensemble', 'RandomForestClassifier', {})
imbalanced_pipeline = make_pipeline(OversamplingClassifier(oversampler, classifier))
Case one
param_grid_one = {'oversamplingclassifier__classifier':
[('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 50}),
('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 1000}),
('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 10}),
('sklearn.ensemble', 'RandomForestClassifier', {'max_depth': 21}),
('sklearn.ensemble', 'RandomForestClassifier', {'max_depth': 2}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_split': 2}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_split': 30}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_leaf': 2}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_leaf': 77})
] }
scorers = {'precision': make_scorer(precision_score, average = 'macro'), 'f1_macro': make_scorer(f1_score, average = 'macro') }
k_folds=5
rskfold = RepeatedStratifiedKFold(n_splits=k_folds, n_repeats=5, random_state=0)
framework_one = GridSearchCV(imbalanced_pipeline,
param_grid=param_grid_one,
cv=rskfold,
n_jobs=-1,
refit='f1_macro',
verbose=0,
scoring=scorers)
results_one=framework_one.fit(X, y)
results_one.best_estimator_.named_steps['oversamplingclassifier']
Case two
param_grid_two = {'oversamplingclassifier__classifier':
[('sklearn.ensemble', 'RandomForestClassifier', {
'max_depth': 10, 'max_depth': 77,
'n_estimators': 50, 'n_estimators': 1000, 'n_estimators': 11,
'min_samples_split': 2, 'min_samples_split': 33,
'min_samples_leaf': 2, 'min_samples_leaf': 77,
})] }
framework_two = GridSearchCV(imbalanced_pipeline,
param_grid=param_grid_two,
cv=rskfold,
n_jobs=-1,
refit='f1_macro',
verbose=0,
scoring=scorers)
results_two=framework_two.fit(X, y)
results_two.best_estimator_.named_steps['oversamplingclassifier']
Hope this helps - let me know if you have questions on these examples! Thank you!
from smote_variants.
Hi @VadimKufenko , sorry for the delay, I'm looking into this right now. I am not sure if I understand the problem properly, although there are many tricky things here.
First of all, thanks for bringing up the issue. This area (multiclass oversampling with grid parameter selection) is slightly used, therefore there can be inconviniences that I can fix in the short term if we manage to come up with the outline of a better usage.
Four things to highlight in advance:
- when it comes to multi-class oversampling, explicit values of the
proportion
parameters are not used. The reason for that is that "proportion of what to what" is unanswered. The proportion parameter is needed internally, to set it varyingly for each class to sample as many samples as needed to match the cardinality of the majority class. That is, given a majority class of 100 vectors, and two further classes with 70 and 50 records, then internally there will be an oversampling with the proportion of 100/70 and another with the proportion of 100/50 to equalize the cardinalities. Whatever is set explicitly as a proportion parameter will be overwritten internally. Therefore, in a multi-class case, grid-search over the proportion parameter executes the same thing behind again and again. - For a bunch of reasons, the
RandomForestClassifier
is not working with SMOTE-like techniques, I mean, it is working, but the performance scores are highest when oversampling is disabled (```proportion=0˙``). There are a bunch of reasons for this, mainly that the SMOTE sampling interferes negatively with the internal operations of random forests, namely, bootstrapping and the random feature selection in the decision nodes. (I'm just working on a paper on how to resolve these issues). Generally, I recommend using other classifiers. - If the classification problem is fairly imbalanced and oversampling cannot fix the issues, I think it can end in F1 scores being zero (when precision or recall is 0 (that is, there are no true positives). In these cases a grid search might end up in the first parameterization, as the score of all parameterizations is the same, 0. I think it is reasonable to work with roc_auc_score, and derive some average roc_auc_score for multi-class problems.
- I recognized that the interface of
MulticlassOverSampling
does not follow the interface of all other objects requiring a tuple of (smote_package, smote_name, smote_parameters), which should be changed in the future.
With all these said, I have come up with something I think you wanted to achieve, and seems to work properly, i.e., not the first or last parameters are selected from the grid: each combination seems to be ealuated properly. Note however, that it contains an iteration over various proportion
parameters, which does not make much sense as they are ignored. Let me know if this is what you wanted to achieve, and if not, what is wrong and how should it work?
from sklearn.datasets import make_classification
from imblearn.pipeline import Pipeline
from sklearn.metrics import precision_score, f1_score, make_scorer
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
import smote_variants as sv
X, y = make_classification(n_samples=500, n_features=10, n_informative=5, n_redundant=5, n_classes=3, n_clusters_per_class=1, weights=[0.8,0.1,0.1], random_state=42)
k_folds=5
rskfold = RepeatedStratifiedKFold(n_splits=k_folds, n_repeats=20, random_state=0)
scorers = {'precision': make_scorer(precision_score, average = 'macro'), 'f1_macro': make_scorer(f1_score, average = 'macro') }
dummy_init_oversampler = ('smote_variants', 'MulticlassOversampling', {'oversampler': 'MWMOTE', 'oversampler_params': {}})
dummy_init_classifier = ('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators':50, 'max_depth': 3, 'min_samples_split': 2})
model= Pipeline([('scale', StandardScaler()),
('clf', sv.classifiers.OversamplingClassifier(dummy_init_oversampler, dummy_init_classifier))])
param_grid= {
'clf__oversampler':[('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.0}}),
('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.25}}),
('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.5}}),
('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.75}}),
('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 1.0}})],
'clf__classifier':[('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': None}),
('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 2}),
('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 10}),
('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 15}),
('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 20}),
('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 25})
] }
framework = GridSearchCV(model,
param_grid=param_grid,
cv=rskfold,
n_jobs=-1,
refit='f1_macro',
verbose=0,
scoring=scorers)
results=framework.fit(X, y)
results.best_estimator_.named_steps['clf']
This code ends with this:
OversamplingClassifier(classifier=('sklearn.tree', 'DecisionTreeClassifier',
{'min_samples_leaf': 2}),
oversampler=('smote_variants', 'MulticlassOversampling',
{'oversampler': 'SMOTE',
'oversampler_params': {'proportion': 0.5}}))
from smote_variants.
Related Issues (20)
- DEAGO : negative values for categorical features inside the data HOT 3
- Minimum number of rows in a class HOT 1
- when use SOMO,Why did the two types of samples not reach a balance and the number did not change HOT 2
- provided out is the wrong size for the reduction
- Categorical Variables HOT 1
- How to vary the "proportion" parameter - MulticlassOversampling class
- Why I get this error when I use smote_variants? HOT 9
- Could I apply this package to the time-series raw data?
- Question HOT 2
- Question: Combining these with Undersampling HOT 3
- Question: Regarding time complexity of Oversamplers and "Noise Filters" HOT 1
- Implement 'verbose' parameter (feature request) HOT 2
- sv.MulticlassOversampling error for getattr() function HOT 2
- Error: Dimension of X_train and y_train is not the same ! HOT 2
- OversamplingClassifier does not work with probability-based metrics HOT 3
- Support for python 3.11 HOT 1
- Remove warnings
- Can smote_variants deal with 3_class data?
- I got this error when I used polynom_fit_SMOTE.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from smote_variants.