Thank you very much for providing the smote_variants package - an excellent tool!

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Dear György <a class="user-mention notranslate" data-hovercard-type="user" data-hoverc

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

GridSearchCV classifier parameters: int vs list about smote_variants HOT 3 OPEN

analyticalmindsltd commented on July 4, 2024

GridSearchCV classifier parameters: int vs list

from smote_variants.

Comments (3)

gykovacs commented on July 4, 2024

Hi @VadimKufenko, thank you for raising the issue, I will look into it in the next couple of days!

from smote_variants.

VadimKufenko commented on July 4, 2024

Dear György @gykovacs , thank you so much for following up! This is very kind of you! I would like to express my fascination with the smote_variants package - a tremendous work!

Meanwhile I tried different arrangements of the parameters since for lists and sets one gets the following error:

ValueError: n_estimators must be an integer, got <class 'set'>
ValueError: n_estimators must be an integer, got <class 'list'>

I saw that set_params(self, **parameters) in the OversamplingClassifier uses dictionaries, but in the end it pins down to integers, so I tried to improvise further. Please see my examples with generated data below - perhaps it helps to identify the issue. With gridsearch, depending on the arrangement, either the i) top parameters are optimized with the rest ignored, or ii) the last ones are listed as the optimal ones, although they can not be optimal.

Please note that I am using imblearn.pipeline, but I had similar issues with the sklearn pipeline.

Example

import numpy as np

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from imblearn.pipeline import Pipeline, make_pipeline # IMBLEARN pipeline
from sklearn.metrics import precision_score, recall_score, average_precision_score, f1_score, recall_score, make_scorer
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV

import smote_variants as sv
from smote_variants.classifiers import OversamplingClassifier

X, y = make_classification(n_samples=2000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, n_clusters_per_class=1, weights=[0.8,0.1,0.1], random_state=42)

oversampler = ('smote_variants', 'MulticlassOversampling',
{'oversampler': 'MWMOTE', 'oversampler_params': {'proportion': 1.0}})

classifier = ('sklearn.ensemble', 'RandomForestClassifier', {})

imbalanced_pipeline = make_pipeline(OversamplingClassifier(oversampler, classifier))

Case one

param_grid_one = {'oversamplingclassifier__classifier':
[('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 50}),
('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 1000}),
('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 10}),
('sklearn.ensemble', 'RandomForestClassifier', {'max_depth': 21}),
('sklearn.ensemble', 'RandomForestClassifier', {'max_depth': 2}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_split': 2}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_split': 30}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_leaf': 2}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_leaf': 77})
] }

scorers = {'precision': make_scorer(precision_score, average = 'macro'), 'f1_macro': make_scorer(f1_score, average = 'macro') }

k_folds=5
rskfold = RepeatedStratifiedKFold(n_splits=k_folds, n_repeats=5, random_state=0)

framework_one = GridSearchCV(imbalanced_pipeline,
param_grid=param_grid_one,
cv=rskfold,
n_jobs=-1,
refit='f1_macro',
verbose=0,
scoring=scorers)

results_one=framework_one.fit(X, y)

results_one.best_estimator_.named_steps['oversamplingclassifier']

Case two

param_grid_two = {'oversamplingclassifier__classifier':
[('sklearn.ensemble', 'RandomForestClassifier', {
'max_depth': 10, 'max_depth': 77,
'n_estimators': 50, 'n_estimators': 1000, 'n_estimators': 11,
'min_samples_split': 2, 'min_samples_split': 33,
'min_samples_leaf': 2, 'min_samples_leaf': 77,
})] }

framework_two = GridSearchCV(imbalanced_pipeline,
param_grid=param_grid_two,
cv=rskfold,
n_jobs=-1,
refit='f1_macro',
verbose=0,
scoring=scorers)
results_two=framework_two.fit(X, y)

results_two.best_estimator_.named_steps['oversamplingclassifier']

Hope this helps - let me know if you have questions on these examples! Thank you!

from smote_variants.

gykovacs commented on July 4, 2024

Hi @VadimKufenko , sorry for the delay, I'm looking into this right now. I am not sure if I understand the problem properly, although there are many tricky things here.

First of all, thanks for bringing up the issue. This area (multiclass oversampling with grid parameter selection) is slightly used, therefore there can be inconviniences that I can fix in the short term if we manage to come up with the outline of a better usage.

Four things to highlight in advance:

when it comes to multi-class oversampling, explicit values of the proportion parameters are not used. The reason for that is that "proportion of what to what" is unanswered. The proportion parameter is needed internally, to set it varyingly for each class to sample as many samples as needed to match the cardinality of the majority class. That is, given a majority class of 100 vectors, and two further classes with 70 and 50 records, then internally there will be an oversampling with the proportion of 100/70 and another with the proportion of 100/50 to equalize the cardinalities. Whatever is set explicitly as a proportion parameter will be overwritten internally. Therefore, in a multi-class case, grid-search over the proportion parameter executes the same thing behind again and again.
For a bunch of reasons, the RandomForestClassifier is not working with SMOTE-like techniques, I mean, it is working, but the performance scores are highest when oversampling is disabled (```proportion=0˙``). There are a bunch of reasons for this, mainly that the SMOTE sampling interferes negatively with the internal operations of random forests, namely, bootstrapping and the random feature selection in the decision nodes. (I'm just working on a paper on how to resolve these issues). Generally, I recommend using other classifiers.
If the classification problem is fairly imbalanced and oversampling cannot fix the issues, I think it can end in F1 scores being zero (when precision or recall is 0 (that is, there are no true positives). In these cases a grid search might end up in the first parameterization, as the score of all parameterizations is the same, 0. I think it is reasonable to work with roc_auc_score, and derive some average roc_auc_score for multi-class problems.
I recognized that the interface of MulticlassOverSampling does not follow the interface of all other objects requiring a tuple of (smote_package, smote_name, smote_parameters), which should be changed in the future.

With all these said, I have come up with something I think you wanted to achieve, and seems to work properly, i.e., not the first or last parameters are selected from the grid: each combination seems to be ealuated properly. Note however, that it contains an iteration over various proportion parameters, which does not make much sense as they are ignored. Let me know if this is what you wanted to achieve, and if not, what is wrong and how should it work?

from sklearn.datasets import make_classification
from imblearn.pipeline import Pipeline
from sklearn.metrics import precision_score, f1_score, make_scorer
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler

import smote_variants as sv

X, y = make_classification(n_samples=500, n_features=10, n_informative=5, n_redundant=5, n_classes=3, n_clusters_per_class=1, weights=[0.8,0.1,0.1], random_state=42)

k_folds=5
rskfold = RepeatedStratifiedKFold(n_splits=k_folds, n_repeats=20, random_state=0)

scorers = {'precision': make_scorer(precision_score, average = 'macro'), 'f1_macro': make_scorer(f1_score, average = 'macro') }


dummy_init_oversampler = ('smote_variants', 'MulticlassOversampling', {'oversampler': 'MWMOTE', 'oversampler_params': {}})

dummy_init_classifier = ('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators':50, 'max_depth': 3, 'min_samples_split': 2})

model= Pipeline([('scale', StandardScaler()), 
                 ('clf', sv.classifiers.OversamplingClassifier(dummy_init_oversampler, dummy_init_classifier))])

param_grid= {
'clf__oversampler':[('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.0}}),
                    ('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.25}}),
                    ('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.5}}),
                    ('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.75}}),
                    ('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 1.0}})],
'clf__classifier':[('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': None}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 2}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 10}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 15}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 20}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 25})
] }


framework = GridSearchCV(model,
                            param_grid=param_grid,
                            cv=rskfold,
                            n_jobs=-1,
                            refit='f1_macro',
                            verbose=0,
                            scoring=scorers)

results=framework.fit(X, y)

results.best_estimator_.named_steps['clf']

This code ends with this:

OversamplingClassifier(classifier=('sklearn.tree', 'DecisionTreeClassifier',
                                   {'min_samples_leaf': 2}),
                       oversampler=('smote_variants', 'MulticlassOversampling',
                                    {'oversampler': 'SMOTE',
                                     'oversampler_params': {'proportion': 0.5}}))

from smote_variants.

GridSearchCV classifier parameters: int vs list about smote_variants HOT 3 OPEN

Comments (3)

Example

Case one

Case two

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent