Comments (3)
Hi @arjunpuri7,
in my impression, 20 billions of instances of 1500 features (altogether 30 trillions of numbers120 terabytes) is far beyond the capabilities of sklearn-related techniques. partial_fit
could be used, but as a matter of fact, smote_variants is not prepared for this load of data. Imbalanced datasets are usually much smaller, and SMOTE techniques are developed for these relatively small datasets.
What is the imbalance rate (#negative/#positive) in your dataset? I would guess, many of your records are redundant, do not add much information to the classification process. Subsampling would make it more easy to handle without a significant loss of information.
from smote_variants.
sir,
I am trying to work with dask library and want to use smote_variants. Data is about some drugs and try to work with imbalance ratio. whole datasets is not load into memory at once, so, I am trying to load data with dask dataframe and want to use smote_variants library to work with datasets with small chuncks of main datasets. If I try to reduce the instances of my datasets then it will refect my study. please help me out.
from smote_variants.
Hi @arjunpuri7 , I hope you managed to overcome the problem. Personally I do not think that oversampling is meaningful to be applied to your huge amount of data, I think some reliable downsampling is what you need. Can we close this issue?
from smote_variants.
Related Issues (20)
- DEAGO : negative values for categorical features inside the data HOT 3
- Minimum number of rows in a class HOT 1
- when use SOMO,Why did the two types of samples not reach a balance and the number did not change HOT 2
- provided out is the wrong size for the reduction
- Categorical Variables HOT 1
- How to vary the "proportion" parameter - MulticlassOversampling class
- Why I get this error when I use smote_variants? HOT 9
- Could I apply this package to the time-series raw data?
- Question HOT 2
- Question: Combining these with Undersampling HOT 3
- Question: Regarding time complexity of Oversamplers and "Noise Filters" HOT 1
- GridSearchCV classifier parameters: int vs list HOT 3
- Implement 'verbose' parameter (feature request) HOT 2
- sv.MulticlassOversampling error for getattr() function HOT 2
- Error: Dimension of X_train and y_train is not the same ! HOT 2
- OversamplingClassifier does not work with probability-based metrics HOT 3
- Support for python 3.11 HOT 1
- Remove warnings
- Can smote_variants deal with 3_class data?
- I got this error when I used polynom_fit_SMOTE.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from smote_variants.