Comments (9)
Hi @mkoz92 👍
This has been fixed. Please upgrade via:
pip install featurewiz --upgrade
It should work now. Please confirm and close.
Thanks
AutoViML
from featurewiz.
Hey AutoViML, after updating still some problem persist, but it behaves now a bit differently:
Skipping feature engineering since no feature_engg input...
Skipping category encoding since no category encoders specified in input...
Loading train data...
Sampling 1000 rows from dataframe given
Since dask_xgboost_flag is True, reducing memory size and loading into dask
Caution: We will try to reduce the memory usage of dataframe from 1.69 MB
Memory usage after optimization is: 0.43 MB
decreased by 74.7%
Converted pandas dataframe into a Dask dataframe ...
Loading test data...
Since dask_xgboost_flag is True, reducing memory size and loading into dask
No file given. Continuing...
Classifying features using 1000 rows...
loading a random sample of 1000 rows into pandas for EDA
############## C L A S S I F Y I N G V A R I A B L E S ####################
Classifying variables in data set...
220 Predictors classified...
120 variable(s) will be ignored since they are ID or low-information variables
No GPU active on this device
Running XGBoost using CPU parameters
Removing 120 columns from further processing since ID or low information variables
columns removed: [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 166, 167, 168, 169, 170, 171, 173, 174, 178, 179, 180, 183, 184, 188, 198, 206]
After removing redundant variables from further processing, features left = 100
#### Single_Label Regression Feature Selection Started ####
Searching for highly correlated variables from 100 variables using SULOV method
##### SULOV : Searching for Uncorrelated List Of Variables (takes time...) ############
SULOV method is erroring. Continuing ...
Adding 0 categorical variables to reduced numeric variables of 100
############## F E A T U R E S E L E C T I O N ####################
Converted pandas dataframe into a Dask dataframe ...
Train and Test loaded into Dask dataframes successfully after feature_engg completed
Current number of predictors = 100
Dask version = 2021.12.0
Using Dask XGBoost algorithm with 24 virtual CPUs and 2GB memory limit...
Dask client configuration: <Client: 'tcp://127.0.0.1:35681' processes=24 threads=24, memory=44.70 GiB>
XGBoost version: 1.5.1
2022-01-03 15:47:07,594 INFO start listen on 10.111.78.92:9091
Num of booster rounds = 100
using 100 variables...
[15:47:07] task [xgboost.dask]:tcp://127.0.0.1:32821 got new rank 0
2022-01-03 15:47:07,891 INFO @tracker All of 1 nodes getting started
2022-01-03 15:47:08,323 INFO @tracker All nodes finishes job
2022-01-03 15:47:08,324 INFO @tracker 0.43133997917175293 secs between node start and job finish
2022-01-03 15:47:08,469 INFO start listen on 10.111.78.92:9091
[15:47:08] task [xgboost.dask]:tcp://127.0.0.1:32821 got new rank 0
2022-01-03 15:47:08,478 INFO @tracker All of 1 nodes getting started
Time taken for training: 1 seconds
using 80 variables...
2022-01-03 15:47:08,782 INFO @tracker All nodes finishes job
2022-01-03 15:47:08,782 INFO @tracker 0.30301523208618164 secs between node start and job finish
2022-01-03 15:47:08,986 INFO start listen on 10.111.78.92:9091
[15:47:08] task [xgboost.dask]:tcp://127.0.0.1:32821 got new rank 0
2022-01-03 15:47:09,001 INFO @tracker All of 1 nodes getting started
Time taken for training: 0 seconds
using 60 variables...
2022-01-03 15:47:09,257 INFO @tracker All nodes finishes job
2022-01-03 15:47:09,258 INFO @tracker 0.25629758834838867 secs between node start and job finish
2022-01-03 15:47:09,361 INFO start listen on 10.111.78.92:9091
[15:47:09] task [xgboost.dask]:tcp://127.0.0.1:32821 got new rank 0
2022-01-03 15:47:09,372 INFO @tracker All of 1 nodes getting started
Time taken for training: 0 seconds
using 40 variables...
2022-01-03 15:47:09,560 INFO @tracker All nodes finishes job
2022-01-03 15:47:09,560 INFO @tracker 0.18846368789672852 secs between node start and job finish
2022-01-03 15:47:09,638 INFO start listen on 10.111.78.92:9091
[15:47:09] task [xgboost.dask]:tcp://127.0.0.1:32821 got new rank 0
2022-01-03 15:47:09,650 INFO @tracker All of 1 nodes getting started
Time taken for training: 0 seconds
using 20 variables...
2022-01-03 15:47:09,809 INFO @tracker All nodes finishes job
2022-01-03 15:47:09,809 INFO @tracker 0.15920519828796387 secs between node start and job finish
Time taken for training: 0 seconds
Dask XGBoost is crashing. Returning with currently selected features...
And there are 4 empty graphs below.
(nevertheless, thank you very much for even trying to fix that)
featurewiz.__version__
shows '0.0.61'
from featurewiz.
Hi @mkoz92 👍
There is a new version 0.0.62 - try to upgrade to that. Also try to change the dask_xgboost_flag to False. Try both True and False and see if it works.
Without data, it is hard to tell what's going on. But try it again.
AutoViML
from featurewiz.
Hey, I include example df that fails, I just saved it using standard pandas.DataFrame.to_pickle
, please tell me if I can help somehow. y
column is the dependent variable.
from featurewiz.
Hi @mkoz92 👍
Michael
Please note that you might be running an old version. It runs perfectly fine in the new version 0.0.71 please upgrade via:
pip install featurewiz --upgrade
Featurewiz_Test_Pickle_File.zip
The only mistake you made is that you didnt change the column names from integers to strings.
I have attached my notebook to show how it works.
Please confirm and close the issue.
AutoViML Team
from featurewiz.
Hi, indeed, the file I sent does work, but it was actually just 10% of the total series, I am able to run up to 90% of the series but when I am trying to run full or 95% of the data I have it does not want to work. Using 90% of data takes 103 according to script, but 95% stops at the below (and FYI, I have definitely more than enough RAM). OH also I tested to like run beginning 90% and ending 90% and it works, but neither 95% front or back works.
2022-01-07 11:25:15,121 INFO start listen on 10.111.116.219:9091
[11:25:15] task [xgboost.dask]:tcp://127.0.0.1:32839 got new rank 0
2022-01-07 11:25:15,401 INFO @tracker All of 1 nodes getting started
Is there some size/length limit?
from featurewiz.
Hi Michal @mkoz92 👍
There is no limit to the size or length in featurewiz for datasets. However, the problem might be in dask itself. You might want to try setting the dask_xgboost_flag=False and try running featurewiz that way. It also works just as fast as featurewiz.
You might want to split the data into 90-10 and use 90% for training and set aside the 10% for testing - that way you can see whether the featurewiz selected features work well on unseen data (test). This is another idea.
You should not worry too much about not training on 100% of data. It is better that you randomly sample 90% of your data to train and get the best features selected and then test it on the remaining 10% than otherwise. Just my 2 cents.
AutoViML
from featurewiz.
Hi Hi,
indeed, when I switched dask_xgboost_flag
it did run on the whole dataset. However, then I tested on the 90% in both ways (True
and False
) and I got different results, is it normal? Furthermore, the final model will have about 100x more data than even the 100% of what I was using now, and ofc I will split data into proper train/test/validation pieces but thank you for your help! :)
from featurewiz.
Hi @mkoz92 👍
Michal:
Indeed the results will be different when you switch dask_xgboost_flag
from True
to False
and vice versa since the order of train samples that is fed to the model changes dramatically. For example, in dask, the data is fed in parallel while without dask, the entire data is fed in sequence. So of course, the results will be different. But I have tried to make the differences as small as possible through very careful calibration.
In the future, when you hit a road block where you can't feed the whole data, you must try to sample the data. It's always good practice.
from featurewiz.
Related Issues (20)
- Category type, indexes don't match on AutoEncoding HOT 3
- Issue with working with Featureviz HOT 1
- Comment has incorrect code ( verbose=0. imbalanced=False [verbose=0, imbalanced=False]) HOT 1
- make tensorflow optional HOT 4
- lazytransform.py float to integer error HOT 2
- Dealing with a Numpy array as features HOT 1
- Convert binary columns to categorical HOT 1
- featurewiz ignores category columns HOT 2
- dont show chart for more than 1000 features HOT 1
- featurewiz ignores category columns with an example HOT 5
- Universal API required for smooth working HOT 4
- Unpin requirements? HOT 2
- Conda package outdated
- TypeError: expected string or bytes-like object on int type column name HOT 2
- Conflict Error Among Poetry Package Dependencies: lazytransform, tqdm, featurewiz HOT 8
- TYPO ERROR
- Typo Error
- Version Conflict for scikit-learn - Bump to 1.3.2 possible? HOT 1
- Can't get featurewiz to work HOT 1
- ValueError: Length mismatch: Expected axis has X elements, new values have Y elements HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from featurewiz.