Git Product home page Git Product logo

Comments (3)

gykovacs avatar gykovacs commented on July 4, 2024

Hi @VadimKufenko, thank you for raising the issue, I will look into it in the next couple of days!

from smote_variants.

VadimKufenko avatar VadimKufenko commented on July 4, 2024

Dear György @gykovacs , thank you so much for following up! This is very kind of you! I would like to express my fascination with the smote_variants package - a tremendous work!

Meanwhile I tried different arrangements of the parameters since for lists and sets one gets the following error:

ValueError: n_estimators must be an integer, got <class 'set'>
ValueError: n_estimators must be an integer, got <class 'list'>

I saw that set_params(self, **parameters) in the OversamplingClassifier uses dictionaries, but in the end it pins down to integers, so I tried to improvise further. Please see my examples with generated data below - perhaps it helps to identify the issue. With gridsearch, depending on the arrangement, either the i) top parameters are optimized with the rest ignored, or ii) the last ones are listed as the optimal ones, although they can not be optimal.

Please note that I am using imblearn.pipeline, but I had similar issues with the sklearn pipeline.

Example

import numpy as np

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from imblearn.pipeline import Pipeline, make_pipeline # IMBLEARN pipeline
from sklearn.metrics import precision_score, recall_score, average_precision_score, f1_score, recall_score, make_scorer
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV

import smote_variants as sv
from smote_variants.classifiers import OversamplingClassifier

X, y = make_classification(n_samples=2000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, n_clusters_per_class=1, weights=[0.8,0.1,0.1], random_state=42)

oversampler = ('smote_variants', 'MulticlassOversampling',
{'oversampler': 'MWMOTE', 'oversampler_params': {'proportion': 1.0}})

classifier = ('sklearn.ensemble', 'RandomForestClassifier', {})

imbalanced_pipeline = make_pipeline(OversamplingClassifier(oversampler, classifier))

Case one

param_grid_one = {'oversamplingclassifier__classifier':
[('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 50}),
('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 1000}),
('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 10}),
('sklearn.ensemble', 'RandomForestClassifier', {'max_depth': 21}),
('sklearn.ensemble', 'RandomForestClassifier', {'max_depth': 2}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_split': 2}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_split': 30}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_leaf': 2}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_leaf': 77})
] }

scorers = {'precision': make_scorer(precision_score, average = 'macro'), 'f1_macro': make_scorer(f1_score, average = 'macro') }

k_folds=5
rskfold = RepeatedStratifiedKFold(n_splits=k_folds, n_repeats=5, random_state=0)

framework_one = GridSearchCV(imbalanced_pipeline,
param_grid=param_grid_one,
cv=rskfold,
n_jobs=-1,
refit='f1_macro',
verbose=0,
scoring=scorers)

results_one=framework_one.fit(X, y)

results_one.best_estimator_.named_steps['oversamplingclassifier']

Case two

param_grid_two = {'oversamplingclassifier__classifier':
[('sklearn.ensemble', 'RandomForestClassifier', {
'max_depth': 10, 'max_depth': 77,
'n_estimators': 50, 'n_estimators': 1000, 'n_estimators': 11,
'min_samples_split': 2, 'min_samples_split': 33,
'min_samples_leaf': 2, 'min_samples_leaf': 77,
})] }

framework_two = GridSearchCV(imbalanced_pipeline,
param_grid=param_grid_two,
cv=rskfold,
n_jobs=-1,
refit='f1_macro',
verbose=0,
scoring=scorers)
results_two=framework_two.fit(X, y)

results_two.best_estimator_.named_steps['oversamplingclassifier']

Hope this helps - let me know if you have questions on these examples! Thank you!

from smote_variants.

gykovacs avatar gykovacs commented on July 4, 2024

Hi @VadimKufenko , sorry for the delay, I'm looking into this right now. I am not sure if I understand the problem properly, although there are many tricky things here.

First of all, thanks for bringing up the issue. This area (multiclass oversampling with grid parameter selection) is slightly used, therefore there can be inconviniences that I can fix in the short term if we manage to come up with the outline of a better usage.

Four things to highlight in advance:

  1. when it comes to multi-class oversampling, explicit values of the proportion parameters are not used. The reason for that is that "proportion of what to what" is unanswered. The proportion parameter is needed internally, to set it varyingly for each class to sample as many samples as needed to match the cardinality of the majority class. That is, given a majority class of 100 vectors, and two further classes with 70 and 50 records, then internally there will be an oversampling with the proportion of 100/70 and another with the proportion of 100/50 to equalize the cardinalities. Whatever is set explicitly as a proportion parameter will be overwritten internally. Therefore, in a multi-class case, grid-search over the proportion parameter executes the same thing behind again and again.
  2. For a bunch of reasons, the RandomForestClassifier is not working with SMOTE-like techniques, I mean, it is working, but the performance scores are highest when oversampling is disabled (```proportion=0˙``). There are a bunch of reasons for this, mainly that the SMOTE sampling interferes negatively with the internal operations of random forests, namely, bootstrapping and the random feature selection in the decision nodes. (I'm just working on a paper on how to resolve these issues). Generally, I recommend using other classifiers.
  3. If the classification problem is fairly imbalanced and oversampling cannot fix the issues, I think it can end in F1 scores being zero (when precision or recall is 0 (that is, there are no true positives). In these cases a grid search might end up in the first parameterization, as the score of all parameterizations is the same, 0. I think it is reasonable to work with roc_auc_score, and derive some average roc_auc_score for multi-class problems.
  4. I recognized that the interface of MulticlassOverSampling does not follow the interface of all other objects requiring a tuple of (smote_package, smote_name, smote_parameters), which should be changed in the future.

With all these said, I have come up with something I think you wanted to achieve, and seems to work properly, i.e., not the first or last parameters are selected from the grid: each combination seems to be ealuated properly. Note however, that it contains an iteration over various proportion parameters, which does not make much sense as they are ignored. Let me know if this is what you wanted to achieve, and if not, what is wrong and how should it work?

from sklearn.datasets import make_classification
from imblearn.pipeline import Pipeline
from sklearn.metrics import precision_score, f1_score, make_scorer
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler

import smote_variants as sv

X, y = make_classification(n_samples=500, n_features=10, n_informative=5, n_redundant=5, n_classes=3, n_clusters_per_class=1, weights=[0.8,0.1,0.1], random_state=42)

k_folds=5
rskfold = RepeatedStratifiedKFold(n_splits=k_folds, n_repeats=20, random_state=0)

scorers = {'precision': make_scorer(precision_score, average = 'macro'), 'f1_macro': make_scorer(f1_score, average = 'macro') }


dummy_init_oversampler = ('smote_variants', 'MulticlassOversampling', {'oversampler': 'MWMOTE', 'oversampler_params': {}})

dummy_init_classifier = ('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators':50, 'max_depth': 3, 'min_samples_split': 2})

model= Pipeline([('scale', StandardScaler()), 
                 ('clf', sv.classifiers.OversamplingClassifier(dummy_init_oversampler, dummy_init_classifier))])

param_grid= {
'clf__oversampler':[('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.0}}),
                    ('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.25}}),
                    ('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.5}}),
                    ('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.75}}),
                    ('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 1.0}})],
'clf__classifier':[('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': None}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 2}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 10}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 15}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 20}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 25})
] }


framework = GridSearchCV(model,
                            param_grid=param_grid,
                            cv=rskfold,
                            n_jobs=-1,
                            refit='f1_macro',
                            verbose=0,
                            scoring=scorers)

results=framework.fit(X, y)

results.best_estimator_.named_steps['clf']

This code ends with this:

OversamplingClassifier(classifier=('sklearn.tree', 'DecisionTreeClassifier',
                                   {'min_samples_leaf': 2}),
                       oversampler=('smote_variants', 'MulticlassOversampling',
                                    {'oversampler': 'SMOTE',
                                     'oversampler_params': {'proportion': 0.5}}))

from smote_variants.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.