Git Product home page Git Product logo

Comments (9)

AutoViML avatar AutoViML commented on July 28, 2024

Hi @mkoz92 👍
This has been fixed. Please upgrade via:

pip install featurewiz --upgrade

It should work now. Please confirm and close.
Thanks
AutoViML

from featurewiz.

mkoz92 avatar mkoz92 commented on July 28, 2024

Hey AutoViML, after updating still some problem persist, but it behaves now a bit differently:

Skipping feature engineering since no feature_engg input...
Skipping category encoding since no category encoders specified in input...
Loading train data...
Sampling 1000 rows from dataframe given
    Since dask_xgboost_flag is True, reducing memory size and loading into dask
    Caution: We will try to reduce the memory usage of dataframe from 1.69 MB
    Memory usage after optimization is: 0.43 MB
        decreased by 74.7%
    Converted pandas dataframe into a Dask dataframe ...
Loading test data...
    Since dask_xgboost_flag is True, reducing memory size and loading into dask
    No file given. Continuing...
Classifying features using 1000 rows...
    loading a random sample of 1000 rows into pandas for EDA
############## C L A S S I F Y I N G  V A R I A B L E S  ####################
Classifying variables in data set...
    220 Predictors classified...
        120 variable(s) will be ignored since they are ID or low-information variables
No GPU active on this device
    Running XGBoost using CPU parameters
Removing 120 columns from further processing since ID or low information variables
    columns removed: [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 166, 167, 168, 169, 170, 171, 173, 174, 178, 179, 180, 183, 184, 188, 198, 206]
    After removing redundant variables from further processing, features left = 100
#### Single_Label Regression Feature Selection Started ####
Searching for highly correlated variables from 100 variables using SULOV method
#####  SULOV : Searching for Uncorrelated List Of Variables (takes time...) ############
    SULOV method is erroring. Continuing ...
    Adding 0 categorical variables to reduced numeric variables  of 100
############## F E A T U R E   S E L E C T I O N  ####################
    Converted pandas dataframe into a Dask dataframe ...
Train and Test loaded into Dask dataframes successfully after feature_engg completed
Current number of predictors = 100 
Dask version = 2021.12.0
    Using Dask XGBoost algorithm with 24 virtual CPUs and 2GB memory limit...
Dask client configuration: <Client: 'tcp://127.0.0.1:35681' processes=24 threads=24, memory=44.70 GiB>
XGBoost version: 1.5.1
2022-01-03 15:47:07,594 INFO     start listen on 10.111.78.92:9091
Num of booster rounds = 100
        using 100 variables...
[15:47:07] task [xgboost.dask]:tcp://127.0.0.1:32821 got new rank 0
2022-01-03 15:47:07,891 INFO     @tracker All of 1 nodes getting started
2022-01-03 15:47:08,323 INFO     @tracker All nodes finishes job
2022-01-03 15:47:08,324 INFO     @tracker 0.43133997917175293 secs between node start and job finish
2022-01-03 15:47:08,469 INFO     start listen on 10.111.78.92:9091
[15:47:08] task [xgboost.dask]:tcp://127.0.0.1:32821 got new rank 0
2022-01-03 15:47:08,478 INFO     @tracker All of 1 nodes getting started
            Time taken for training: 1 seconds
        using 80 variables...
2022-01-03 15:47:08,782 INFO     @tracker All nodes finishes job
2022-01-03 15:47:08,782 INFO     @tracker 0.30301523208618164 secs between node start and job finish
2022-01-03 15:47:08,986 INFO     start listen on 10.111.78.92:9091
[15:47:08] task [xgboost.dask]:tcp://127.0.0.1:32821 got new rank 0
2022-01-03 15:47:09,001 INFO     @tracker All of 1 nodes getting started
            Time taken for training: 0 seconds
        using 60 variables...
2022-01-03 15:47:09,257 INFO     @tracker All nodes finishes job
2022-01-03 15:47:09,258 INFO     @tracker 0.25629758834838867 secs between node start and job finish
2022-01-03 15:47:09,361 INFO     start listen on 10.111.78.92:9091
[15:47:09] task [xgboost.dask]:tcp://127.0.0.1:32821 got new rank 0
2022-01-03 15:47:09,372 INFO     @tracker All of 1 nodes getting started
            Time taken for training: 0 seconds
        using 40 variables...
2022-01-03 15:47:09,560 INFO     @tracker All nodes finishes job
2022-01-03 15:47:09,560 INFO     @tracker 0.18846368789672852 secs between node start and job finish
2022-01-03 15:47:09,638 INFO     start listen on 10.111.78.92:9091
[15:47:09] task [xgboost.dask]:tcp://127.0.0.1:32821 got new rank 0
2022-01-03 15:47:09,650 INFO     @tracker All of 1 nodes getting started
            Time taken for training: 0 seconds
        using 20 variables...
2022-01-03 15:47:09,809 INFO     @tracker All nodes finishes job
2022-01-03 15:47:09,809 INFO     @tracker 0.15920519828796387 secs between node start and job finish
            Time taken for training: 0 seconds
Dask XGBoost is crashing. Returning with currently selected features...

And there are 4 empty graphs below.

(nevertheless, thank you very much for even trying to fix that)

featurewiz.__version__ shows '0.0.61'

from featurewiz.

AutoViML avatar AutoViML commented on July 28, 2024

Hi @mkoz92 👍
There is a new version 0.0.62 - try to upgrade to that. Also try to change the dask_xgboost_flag to False. Try both True and False and see if it works.
Without data, it is hard to tell what's going on. But try it again.
AutoViML

from featurewiz.

mkoz92 avatar mkoz92 commented on July 28, 2024

Hey, I include example df that fails, I just saved it using standard pandas.DataFrame.to_pickle, please tell me if I can help somehow. y column is the dependent variable.

example_pickle_df.pkl.zip

from featurewiz.

AutoViML avatar AutoViML commented on July 28, 2024

Hi @mkoz92 👍
Michael
Please note that you might be running an old version. It runs perfectly fine in the new version 0.0.71 please upgrade via:
pip install featurewiz --upgrade
Featurewiz_Test_Pickle_File.zip

The only mistake you made is that you didnt change the column names from integers to strings.
I have attached my notebook to show how it works.
Please confirm and close the issue.
AutoViML Team

from featurewiz.

mkoz92 avatar mkoz92 commented on July 28, 2024

Hi, indeed, the file I sent does work, but it was actually just 10% of the total series, I am able to run up to 90% of the series but when I am trying to run full or 95% of the data I have it does not want to work. Using 90% of data takes 103 according to script, but 95% stops at the below (and FYI, I have definitely more than enough RAM). OH also I tested to like run beginning 90% and ending 90% and it works, but neither 95% front or back works.

2022-01-07 11:25:15,121 INFO     start listen on 10.111.116.219:9091
[11:25:15] task [xgboost.dask]:tcp://127.0.0.1:32839 got new rank 0
2022-01-07 11:25:15,401 INFO     @tracker All of 1 nodes getting started

Is there some size/length limit?

from featurewiz.

AutoViML avatar AutoViML commented on July 28, 2024

Hi Michal @mkoz92 👍
There is no limit to the size or length in featurewiz for datasets. However, the problem might be in dask itself. You might want to try setting the dask_xgboost_flag=False and try running featurewiz that way. It also works just as fast as featurewiz.

You might want to split the data into 90-10 and use 90% for training and set aside the 10% for testing - that way you can see whether the featurewiz selected features work well on unseen data (test). This is another idea.

You should not worry too much about not training on 100% of data. It is better that you randomly sample 90% of your data to train and get the best features selected and then test it on the remaining 10% than otherwise. Just my 2 cents.

AutoViML

from featurewiz.

mkoz92 avatar mkoz92 commented on July 28, 2024

Hi Hi,
indeed, when I switched dask_xgboost_flag it did run on the whole dataset. However, then I tested on the 90% in both ways (True and False) and I got different results, is it normal? Furthermore, the final model will have about 100x more data than even the 100% of what I was using now, and ofc I will split data into proper train/test/validation pieces but thank you for your help! :)

from featurewiz.

AutoViML avatar AutoViML commented on July 28, 2024

Hi @mkoz92 👍
Michal:

Indeed the results will be different when you switch dask_xgboost_flag from True to False and vice versa since the order of train samples that is fed to the model changes dramatically. For example, in dask, the data is fed in parallel while without dask, the entire data is fed in sequence. So of course, the results will be different. But I have tried to make the differences as small as possible through very careful calibration.

In the future, when you hit a road block where you can't feed the whole data, you must try to sample the data. It's always good practice.

from featurewiz.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.