Comments (10)
Hi @glemaitre, absolutely, I was planning for a long time to contact you about this, but you were faster.
I think imbalanced-learn
is a fairly mature package, so we definitely shouldn't make smote-variants
a dependency of imbalanced-learn
, rather, we should select some techniques and translate the codes or reimplement them following the super high-quality standards of imbalanced-learn
. In my benchmarking, I have arrived to 6 methods which finish in the top 3 places on various types of datasets, I think these 6 should prove useful in various applications: polynom-fit-SMOTE, ProWSyn, SMOTE-IPF, Lee, SMOBD, G-SMOTE. Alternatively, shooting for the top 3, we could go for polynom-fit-SMOTE, ProWSyn and SMOTE-IPF.
I absolutely agree with the benchmarking of other techniques, too, honestly, this would have been my next project in this topic. I can refine and generalize the evaluation framework quickly. I think we should select the scope (the methods of interest) properly and we could kick off something like this very quickly.
I was also thinking about creating some sort of a "super-wrapper" package, which would wrap oversampling, ensemble, and cost-sensitive learning techniques, providing a somewhat standardized interface, exactly for the ease of benchmarking and experimentation. The benchmarking framework would fit this super-wrapper package pretty well.
Any comments are welcome!
from smote_variants.
We are absolutely on the same line
I have arrived in 6 methods which finish in the top 3 places on various types of datasets
I think that this is the way to go.
On our side, I think that we can become more conservative for including new SMOTE variants. We can first make implement them in smote_variant
, if not already present, and use the benchmark for inclusion. It will help us a lot on the documentation side, justifying the included models and the way they work. We can always refer to smote_variant
for people who want to try more exotic versions.
I absolutely agree with the benchmarking of other techniques, too, honestly, this would have been my next project in this topic. I can refine and generalize the evaluation framework quickly. I think we should select the scope (the methods of interest) properly and we could kick off something like this very quickly.
It always has been an objective of @chkoar and myself but we lack some time-bandwidth lately. Reusing some infrastructure would be really useful.
I was also thinking about creating some sort of a "super-wrapper" package, which would wrap oversampling, ensemble, and cost-sensitive learning techniques, providing a somewhat standardized interface, exactly for the ease of benchmarking and experimentation. The benchmarking framework would fit this super-wrapper package pretty well.
This would need to be discussed more in details but it could be one way to go.
Regarding cost-sensitive methods, we were thinking about including some. In some way, we thought to trigger imbalanced-learn 1.0.0 to reorganised the module to take into account different approaches.
from smote_variants.
Great! In order to improve the benchmarking, I try to set up some sort of a fully reproducible auto-benchmarking system as some CI/CD job. I feel like this would be the right way to keep the evaluation transparent and fully reproducible. I also think in this way smote-variants
can do a good job as an experimentation sandbox behind imblearn
.
from smote_variants.
Regarding a continuous benchmark, it is really what I had in mind: scikit-learn-contrib/imbalanced-learn#646 (comment)
@chkoar is more interested in implementing all possible methods and let the user choose. I would at first prefer to reduce the number of samplers. I would consider the first option valid only if we have a good continuous benchmark running and strong documentation referring to it.
How much resources fo your benchmark requires? How long is it taking to run the experiment?
from smote_variants.
Well, the experiment I run and describe in the paper took something like 3 weeks on a 32 core AWS instance, involving 85 methods with 35 different parameter settings, 4 classifiers on top of that with 6 different parameter settings for each, and a repeated k-fold cross validation with 5 splits and 3 repeats, all of that involving 104 datasets.
EDIT:
Training the classifiers on top of the various oversampling strategies takes 80% of the time.
That's clearly too much computational work, but the majority of it was caused by 5-10 "large" datasets and 3-5 very slow, evolutionary oversampling techniques. I think that
- reducing the 35 parameter settings to, say, 15,
- the classifier parameters combinations to about 3-4,
- reducing the datasets to 60-70 small ones
- reducing the number of repeats in the repeated k-fold cross-validation
- and setting some reasonable timeout for each method
could reduce the work to a couple of hours on a 32-64 core instance.
from smote_variants.
@glemaitre @gykovacs IMHO the methods that we have to implement or include in imblearn
and what method the user will pick is completely unrelated things. We already know that plain SMOTE will do the job. But, since we have the no free lunch theorem I believe that we should not care which is the best to include. We could prioritize by the number of citations (I do not want to set a threshold) or some other thing. For me we need a benchmark just for the timings and we should commit on that. imblearn
should have the fastest and accurate (as described in papers) implementations. That's my two cents.
from smote_variants.
@chkoar If we target well-described, and established methods (which appeared in highly cited journals), the number of potential techniques to include will drop to about 20-30. On the other hand, in my experience, these are typically not the best performers in average - but in the same time, "average performance" is always questionable due to the no free lunch.
Seemingly, the question is whether we believe the outcome of a reasonable benchmark. I think it might make sense, as the methods users look for should perform well on the "smooth" problems related to real classification datasets, and this might be captured by a benchmark dataset.
One more remark on my experiences: usually less-established, simple methods were found to be robust enough to provide acceptable performance on all datasets. These are usually described in hard-to-access, super short conference papers.
from smote_variants.
@chkoar If we target well-described, and established methods (which appeared in highly cited journals), the number of potential techniques to include will drop to about 20-30. On the other hand, in my experience, these are typically not the best performers in average - but in the same time, "average performance" is always questionable due to the no free lunch.
As I said, I didn't mean about inclusion but for prioritization. So we will not have a bunch of methods initially, as it's @glemaitre concern if I understood correctly.
One more remark on my experiences: usually less-established, simple methods were found to be robust enough to provide acceptable performance on all datasets. These are usually described in hard-to-access, super short conference papers.
I totally agree. That's why I do not find a reason for a method not to be included in the imblearn
and only just the top (most cited, best performing across classifiers, etc) ones. As you said always should be a case where a specific over-sampler could perform well.
If that is was the case the main scikit-learn
package would have only 5 methods. That's my other two cents.
from smote_variants.
I did some experimentation with CircleCI, it doesn't seem to be suitable for an automated benchmarking in the community subscription plan, too much of a workload even if one relatively small dataset is used.
I also got concerned about my previous idea to use CI/CD for benchmarking. I can imagine a standalone benchmarking solution, which can be installed to any machine, checks out packages and datasets providing some quasi-standard interfaces for benchmarking, runs experiments where code has changed, and publishes the results on a local web-server.
Maintaining the solution and linking something like this to any documentation page doesn't seem to be a burden, yet the solution is flexible and can be moved around in the clouds easily when needed.
I think my company could even finance an instance like this. The main difference compared to CI/CD is that it would run the benchmarking regularly, not on pull requests or any other hooks.
Any comments are welcome! Do you have experience or anything particular on your mind regarding a proper benchmarking solution?
from smote_variants.
@gykovacs Would you be interested in testing your benchmarks on the newer LoRAS
and ProWRAS
implementations I wrote here: https://github.com/zoj613/pyloras ? I do not think they are implemented in either of the 2 packages.
They do seem promising, at least to my untrained eye.
from smote_variants.
Related Issues (20)
- Minimum number of rows in a class HOT 1
- when use SOMO,Why did the two types of samples not reach a balance and the number did not change HOT 2
- provided out is the wrong size for the reduction
- Categorical Variables HOT 1
- How to vary the "proportion" parameter - MulticlassOversampling class
- Why I get this error when I use smote_variants? HOT 9
- Could I apply this package to the time-series raw data?
- Question HOT 2
- Question: Combining these with Undersampling HOT 3
- Question: Regarding time complexity of Oversamplers and "Noise Filters" HOT 1
- GridSearchCV classifier parameters: int vs list HOT 3
- Implement 'verbose' parameter (feature request) HOT 2
- sv.MulticlassOversampling error for getattr() function HOT 2
- Error: Dimension of X_train and y_train is not the same ! HOT 2
- OversamplingClassifier does not work with probability-based metrics HOT 3
- Support for python 3.11 HOT 1
- Remove warnings
- Can smote_variants deal with 3_class data?
- I got this error when I used polynom_fit_SMOTE.
- model hyperparameters be adjusted before and after oversampling?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from smote_variants.