Git Product home page Git Product logo

cod3licious / autofeat Goto Github PK

View Code? Open in Web Editor NEW
478.0 19.0 58.0 566 KB

Linear Prediction Model with Automated Feature Engineering and Selection Capabilities

License: MIT License

Python 69.27% Jupyter Notebook 30.73%
machine-learning machine-learning-models linear-regression feature-engineering feature-selection automl automated-machine-learning automated-feature-engineering automated-data-science

autofeat's Introduction

autofeat library

Linear Prediction Models with Automated Feature Engineering and Selection

This library contains the AutoFeatRegressor and AutoFeatClassifier models with a similar interface as scikit-learn models:

  • fit() function to fit the model parameters
  • predict() function to predict the target variable given the input
  • predict_proba() function to predict probabilities of the target variable given the input (classifier only)
  • score() function to calculate the goodness of the fit (R^2/accuracy)
  • fit_transform() and transform() functions, which extend the given data by the additional features that were engineered and selected by the model

When calling the fit() function, internally the fit_transform() function will be called, so if you're planing to call transform() on the same data anyways, just call fit_transform() right away. transform() is mostly useful if you've split your data into training and test data and did not call fit_transform() on your whole dataset. The predict() and score() functions can either be given data in the format of the original dataframe that was used when calling fit()/fit_transform() or they can be given an already transformed dataframe.

In addition, only the feature selection part is also available in the FeatureSelector model.

Furthermore (as of version 2.0.0), minimal feature selection (removing zero variance and redundant features), engineering (simple product and ratio of features), and scaling (power transform to make features more normally distributed) is also available in the AutoFeatLight model.

The AutoFeatRegressor, AutoFeatClassifier, and FeatureSelector models need to be fit on data without NaNs, as they internally call the sklearn LassoLarsCV model, which can not handle NaNs. When calling transform(), NaNs (but not np.inf) are okay.

The autofeat examples notebook contains a simple usage example - try it out! :) Additional examples can be found in the autofeat benchmark notebooks for regression (which also contains the code to reproduce the results from the paper mentioned below) and classification, as well as the testing scripts.

Please keep in mind that since the AutoFeatRegressor and AutoFeatClassifier models can generate very complex features, they might overfit on noise in the dataset, i.e., find some new features that lead to good prediction on the training set but result in a poor performance on new test samples. While this usually only happens for datasets with very few samples, we suggest you carefully inspect the features found by autofeat and use those that make sense to you to train your own models.

Depending on the number of feateng_steps (default 2) and the number of input features, autofeat can generate a very huge feature matrix (before selecting the most appropriate features from this large feature pool). By specifying in feateng_cols those columns that you expect to be most valuable in the feature engineering part, the number of features can be greatly reduced. Additionally, transformations can be limited to only those feature transformations that make sense for your data. Last but not least, you can subsample the data used for training the model to limit the memory requirements. After the model was fit, you can call transform() on your whole dataset to generate only those few features that were selected during fit()/fit_transform().

Installation

You can either download the code from here and include the autofeat folder in your $PYTHONPATH or install (the library components only) via pip:

$ pip install autofeat

The library requires Python 3! Other dependencies: numpy, pandas, scikit-learn, sympy, joblib, pint and numba.

Paper

For further details on the model and implementation please refer to the paper - and of course if any of this code was helpful for your research, please consider citing it:

@inproceedings{horn2019autofeat,
  title={The autofeat Python Library for Automated Feature Engineering and Selection},
  author={Horn, Franziska and Pack, Robert and Rieger, Michael},
  booktitle={Joint European Conference on Machine Learning and Knowledge Discovery in Databases},
  pages={111--120},
  year={2019},
  organization={Springer}
}

If you don't like reading, you can also watch a video of my talk at the PyData conference about automated feature engineering and selection with autofeat.

The code is intended for research purposes.

If you have any questions please don't hesitate to send me an email and of course if you should find any bugs or want to contribute other improvements, pull requests are very welcome!

Acknowledgments

This project was made possible thanks to support by BASF.

autofeat's People

Contributors

cod3licious avatar jeethu avatar mglowacki100 avatar stephanos-stephani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autofeat's Issues

In toydata, autofeat finds the correct function (square) only under some circumstances

Hi,
not sure if this is a bug, but I thought the behavior was quite strange. I created some toy data, a set of independent variables x_i in [0, 1) and as target I compute y=x_0^2. The other x_i are noise variables.
Autofeat doesn't find the correct functional form in this case, instead it finds something like a*x_0 + b*1/x_0,
for suitable values of a, b.
However, when I increase the range of x_0 from [0, 1) to [-1, 1), autofeat easily finds the correct functional form.

Here's my code for data generation:

def generate_toydata(n_samples, fun, x_min, x_max, noise_level=0.1):
    x_range = x_max - x_min
    # main feature
    x_main = (np.random.rand(n_samples) * x_range) + x_min
    # Generate 4 unrelated features
    x_unrelated = np.random.rand(n_samples, 4)
    y = fun(x_main) + np.random.normal(0, noise_level, n_samples)
    X = np.column_stack((x_main, x_unrelated))
    return X, y


X, y = generate_toydata(
    n_samples=100,
    fun=np.square,
    x_min=0.0,
    x_max=1.0,
    noise_level=0.01
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False,
)

pandas corr is too slow; use numpy instead

When running on a medium dataset (~50K instances, ~35 features), I noticed that autofeat's AutoFeatLight was taking hours and hours to complete. After debugging a bit, the bottleneck was the line:

corrmat = df.corr().abs()

When I replaced it with Numpy, the code ran orders of magnitude faster:

    corrmat = pd.DataFrame(np.abs(np.corrcoef(df.values, rowvar=False)), columns=df.columns)

error: Unable to find vcvarsall.bat

I got this error on windows 10:

df = model.fit_transform(X, y)
[AutoFeatRegression] The 3 step feature engineering process could generate up to 2864745 features.
[AutoFeatRegression] With 48573 data points this new feature matrix would use about 556.60 gb of space.
Step 1: transformation of original features
              0/             83Traceback (most recent call last):
  File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\utilities\autowrap.py", line 168, in _process_files
    retoutput = check_output(command, stderr=STDOUT)
  File "D:\Anaconda3\envs\quant\lib\subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "D:\Anaconda3\envs\quant\lib\subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['D:\\Anaconda3\\envs\\quant\\python.exe', 'setup.py', 'build_ext', '--inplace']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "D:\Anaconda3\envs\quant\lib\site-packages\IPython\core\interactiveshell.py", line 3291, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-818b183e7273>", line 1, in <module>
    df = model.fit_transform(X, y)
  File "D:\Anaconda3\envs\quant\lib\site-packages\autofeat\autofeat.py", line 249, in fit_transform
    self.feateng_steps, self.transformations)
  File "D:\Anaconda3\envs\quant\lib\site-packages\autofeat\feateng.py", line 249, in generate_features
    original_features.extend(apply_tranformations(original_features))
  File "D:\Anaconda3\envs\quant\lib\site-packages\autofeat\feateng.py", line 194, in apply_tranformations
    f = ufuncify(t, expr_temp)
  File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\core\cache.py", line 94, in wrapper
    retval = cfunc(*args, **kwargs)
  File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\utilities\autowrap.py", line 1105, in ufuncify
    return code_wrapper.wrap_code(routines, helpers=helps)
  File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\utilities\autowrap.py", line 828, in wrap_code
    self._process_files(routines)
  File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\utilities\autowrap.py", line 172, in _process_files
    " ".join(command), e.output.decode('utf-8')))
sympy.utilities.autowrap.CodeWrapError: Error while executing command: D:\Anaconda3\envs\quant\python.exe setup.py build_ext --inplace. Command output is:
running build_ext
running build_src
build_src
building extension "wrapper_module_1" sources
build_src: building npy-pkg config files
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
customize MSVCCompiler using build_ext
building 'wrapper_module_1' extension
compiling C sources
error: Unable to find vcvarsall.bat

[enhancement] add predict_proba for classifiers

Autofeat provides only predict for classification problems, it should also provide predict_proba for use cases when we're not interested in 'hard prediction' but 'ranking'. I think it should be hard to expose it from underlying sklearn models.

least_angle.py in lars_path: ValueError: operands could not be broadcast together with shapes

Known issue related to sklearn bug: scikit-learn/scikit-learn#9603
Sometimes (randomly) during the feature selection process, the LassoLARS model crashes with an error like this

.../lib/python3.6/site-packages/sklearn/linear_model/least_angle.py in lars_path(X, y, Xy, Gram, max_iter, alpha_min, method, copy_X, eps, copy_Gram, verbose, return_path, return_n_iter, positive)
    378                                  least_squares)
    379 
--> 380         g1 = arrayfuncs.min_pos((C - Cov) / (AA - corr_eq_dir + tiny32))
    381         if positive:
    382             gamma_ = min(g1, C / AA)

ValueError: operands could not be broadcast together with shapes (397,) (396,)

Try updating your sklearn and numpy versions; otherwise I'm afraid there is not much that could be done here.

Documentation Enhancement for getting model, features, coefficients


name: Documentation Enhancement
about: Request for clarification or improvement in the documentation


Description of the Issue

I've been using the autofeat library and encountered difficulties in obtaining the final model, model features, and coefficients as described in the library's documentation.

  • The model.get_model(), model.get_model_features(), and model.get_model_coefs() methods do not seem to work as expected.
  • I would appreciate clarification on how to access the final model, its features, and coefficients or an update to the documentation.

Additional Information

  • Python Version: [e.g., Python 3.9]
  • autofeat Library Version: [e.g., autofeat 2.1.2]

Note to Maintainers

Please consider updating the documentation or providing guidance on how to access the final model, features, and coefficients using the AutoFeatRegressor class. This would greatly help users in understanding and utilizing the library effectively.

Thank you for your attention to this matter.

ufunc '_lambdifygenerated' did not contain a loop with signature matching types (<class 'numpy.dtype[float32]'>, <class 'numpy.dtype[float32]'>) -> None

Hi, Drs,
X_train and y_train are Dataframe of array of Float64 when I use the following code:

    alldata1= sio.loadmat(path)['input_train'].ravel()
    x_train=pd.DataFrame(alldata1[folds])
    x_train.columns=input_new_columns
    alldata2= sio.loadmat(path)['output_train'].ravel()
    y_train=pd.DataFrame(alldata2[folds])
    y_train.columns=['targat']
    alldata3= sio.loadmat(path)['input_test'].ravel()
    x_test=pd.DataFrame(alldata3[folds])
    x_test.columns=input_new_columns
    alldata4= sio.loadmat(path)['output_test'].ravel()
    y_test=pd.DataFrame(alldata4[folds])
    y_test.columns=['targat']
    alldata5= sio.loadmat(path)['input_vaildation'].ravel()
    x_validation=pd.DataFrame(alldata5[folds])
    x_validation.columns=input_new_columns
    alldata6= sio.loadmat(path)['output_vaildation'].ravel()
    y_validation=pd.DataFrame(alldata6[folds])
    y_validation.columns=['targat']
    afreg = AutoFeatClassifier(feateng_steps=2,categorical_cols=sparse_new_columns)
    x_train_tr = afreg.fit_transform(x_train, y_train)

but I find an error:
UFuncTypeError ufunc '_lambdifygenerated' did not contain a loop with signature matching types (< class 'numpy.dtype[float32]'> , < class 'numpy.dtype[float32]'>) -> None

May I ask what causes this?

ValueError: Input X contains NaN.

afreg = AutoFeatRegressor(verbose=1, feateng_steps=3, n_jobs=24)
afreg.fit(x_train, y_train)

ValueError Traceback (most recent call last)
Input In [15], in <cell line: 2>()
1 afreg = AutoFeatRegressor(verbose=1, feateng_steps=1, n_jobs=24)
----> 2 afreg.fit(x_train, y_train)

File /app/conda/b2b-latest/lib/python3.8/site-packages/autofeat/autofeat.py:422, in AutoFeatModel.fit(self, X, y)
420 print("[AutoFeat] Warning: This just calls fit_transform() but does not return the transformed dataframe.")
421 print("[AutoFeat] It is much more efficient to call fit_transform() instead of fit() and transform()!")
--> 422 _ = self.fit_transform(X, y) # noqa
423 return self

File /app/conda/b2b-latest/lib/python3.8/site-packages/autofeat/autofeat.py:353, in AutoFeatModel.fit_transform(self, X, y)
351 else:
352 if self.problem_type in ("regression", "classification"):
--> 353 good_cols = select_features(
354 df_subs, target_sub, self.featsel_runs, None, self.problem_type, self.n_jobs, self.verbose
355 )
356 # if no features were selected, take the original features
357 if not good_cols:

File /app/conda/b2b-latest/lib/python3.8/site-packages/autofeat/featsel.py:245, in select_features(df, target, featsel_runs, keep, problem_type, n_jobs, verbose)
241 def flatten_lists(l: list):
242 return [item for sublist in l for item in sublist]
244 selected_columns = flatten_lists(
--> 245 Parallel(n_jobs=n_jobs, verbose=100 * verbose)(delayed(run_select_features)(i) for i in range(featsel_runs))
246 )
248 if selected_columns:
249 selected_columns = Counter(selected_columns)

File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:1944, in Parallel.call(self, iterable)
1938 # The first item from the output is blank, but it makes the interpreter
1939 # progress until it enters the Try/Except block of the generator and
1940 # reach the first yield statement. This starts the aynchronous
1941 # dispatch of the tasks to the workers.
1942 next(output)
-> 1944 return output if self.return_generator else list(output)

File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:1587, in Parallel._get_outputs(self, iterator, pre_dispatch)
1584 yield
1586 with self._backend.retrieval_context():
-> 1587 yield from self._retrieve()
1589 except GeneratorExit:
1590 # The generator has been garbage collected before being fully
1591 # consumed. This aborts the remaining tasks if possible and warn
1592 # the user if necessary.
1593 self._exception = True

File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:1691, in Parallel._retrieve(self)
1684 while self._wait_retrieval():
1685
1686 # If the callback thread of a worker has signaled that its task
1687 # triggered an exception, or if the retrieval loop has raised an
1688 # exception (e.g. GeneratorExit), exit the loop and surface the
1689 # worker traceback.
1690 if self._aborting:
-> 1691 self._raise_error_fast()
1692 break
1694 # If the next job is not ready for retrieval yet, we just wait for
1695 # async callbacks to progress.

File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:1726, in Parallel._raise_error_fast(self)
1722 # If this error job exists, immediatly raise the error by
1723 # calling get_result. This job might not exists if abort has been
1724 # called directly or if the generator is gc'ed.
1725 if error_job is not None:
-> 1726 error_job.get_result(self.timeout)

File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:735, in BatchCompletionCallBack.get_result(self, timeout)
729 backend = self.parallel._backend
731 if backend.supports_retrieve_callback:
732 # We assume that the result has already been retrieved by the
733 # callback thread, and is stored internally. It's just waiting to
734 # be returned.
--> 735 return self._return_or_raise()
737 # For other backends, the main thread needs to run the retrieval step.
738 try:

File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:753, in BatchCompletionCallBack._return_or_raise(self)
751 try:
752 if self.status == TASK_ERROR:
--> 753 raise self._result
754 return self._result
755 finally:

ValueError: Input X contains NaN.
LassoLarsCV does not accept missing values encoded as NaN natively.

can you share simple example to see this mentioned overfit

can you share simple example to see this mentioned overfit
Please keep in mind that since the AutoFeatRegressor and AutoFeatClassifier models can generate very complex features, it will likely overfit on noise in the dataset, though the coefficients for features related to noise should be fairly small. It is generally suggested to carefully inspect the features found by autofeat and use those that make sense to you to train your own models

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Getting this error while using the benchmark of autofeat.AutoFeatClassifier

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 8
      6 import numpy as np
      7 import pandas as pd
----> 8 from autofeat import AutoFeatClassifier
      9 from sklearn.datasets import load_breast_cancer, load_iris, load_wine
     10 from sklearn.ensemble import RandomForestClassifier

File ~/my-conda-envs/fraud/lib/python3.8/site-packages/autofeat/__init__.py:6
      4 name = "autofeat"
      5 __version__ = "2.1.0"
----> 6 from .autofeatlight import AutoFeatLight  # noqa
      7 from .autofeat import AutoFeatModel, AutoFeatRegressor, AutoFeatClassifier  # noqa
      8 from .featsel import FeatureSelector  # noqa

File ~/my-conda-envs/fraud/lib/python3.8/site-packages/autofeat/autofeatlight.py:56
     51         print(sorted(useless_cols))
     52     return [c for c in df.columns if c not in useless_cols]
     55 def _compute_additional_features(
---> 56     X: np.ndarray, feature_names: list | None = None, compute_ratio: bool = True, compute_product: bool = True, verbose: int = 0
     57 ) -> Tuple[np.ndarray, list]:
     58     """
     59     Compute additional non-linear features from the original features (ratio or product of two features).
     60 
   (...)
     69         - list with n_additional_features names describing the newly computed features
     70     """
     71     # check how many new features we will compute

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Cython implementation

I was thinking of using cython to run autofeat in C and changing some of the variables to static. Do you think this would make autofeat significantly faster?

Correlation matrix can have inconsistent column and row names

#28 introduced a nice speed improvement by using numpy instead of pandas for correlation.

There seems to be a small bug on this line though:
https://github.com/cod3licious/autofeat/blob/master/autofeat/autofeatlight.py#L35

By not specifying an index, the correlation matrix might have feature names as column headers but numeric row indices. This in turn throws off the elimination of correlated columns.

Adding index=df.columns recovers the old behavior:

from pandas.testing import assert_frame_equal

corrmat_old = df.corr().abs()
corrmat = pd.DataFrame(
    np.abs(np.corrcoef(df.values, rowvar=False)), columns=df.columns, index=df.columns
)

assert_frame_equal(corrmat_old, corrmat)

Scaling and Autofeat

Thank you for the awesome library.

Just have a question about feature scaling, I've read that features need to be scaled BEFORE using autofeat (https://analyticsindiamag.com/guide-to-automatic-feature-engineering-using-autofeat/).

Is this true? I can't seem to find information in the repository that supports this statement.

When running autofeat is seems to perform scaling during feature selection so I am just wondering if I have to do scaling beforehand or if I can just throw my data into autofeat and it is taken care of.
Screen Shot 2023-02-24 at 12 24 55 PM

Parallelization

Great package!

Have you looked into using joblib on this loop,
for i, (feat1, feat2) in enumerate(feature_tuples)
in feateng.py?

Seems like it could be easily parallelized. You have several global variables inside the loop, so I haven't had a chance to deconstruct this into a function that could be called via joblib. I thought it might be more straightforward for you as you are more familiar with the code.

Input contains NaN, infinity or a value too large for dtype('float32') on fit_transform

Facing the following issue when running AutoFeatRegressor.fit_transform(featuresDf, targetFeature):

Already checked if there are any infinity values or nan. Also, converted everything to float32. Any pointers? Thanks!

Update: tried the same set of inputs with FeatureSelector and everything is working great.

Update 2: posted this question on StackOverflow

Exception

c:\dox\rnd\ml-pipeline-notebooks\modules\autoFeat.py in augmentFeatures(features, targetFeature, verbose)
     41     print(featuresDf.head())
     42     print(targetFeature)
---> 43     newFeatureDf = autoFeatRegressor.fit_transform(featuresDf, targetFeature)
     44 
     45     if verbose:
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    112         ):
    113             type_err = "infinity" if allow_nan else "NaN, infinity"
--> 114             raise ValueError(
    115                 msg_err.format(
    116                     type_err, msg_dtype if msg_dtype is not None else X.dtype

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Logs (verbose)

image

Reproducibility issue

Hello,

I noticed that results are not reproducible by using the library i.e. when using sklearn drop-down-replacement classes, they will each time produce slightly different results.

For example, when using:

features_engineer = AutoFeatClassifier()
features_engineer.fit_transform(data_train.data, data_train.target.value)

, it will calculate (or select) different features each time.

The issue above I temporarily fixed by using:

 random.seed(seed)
 np.random.seed(seed)

, so that the outputs produced by AutoFeatClassifier stay constant among runs.

However, when I tried using the following:

selector = FeatureSelector(verbose=self.verbose, problem_type="classification", featsel_runs=5)
selector.fit_transform(df_indices, target)

, the above-mentioned seed setting trick didn't translate into desirable outcome - the selected features still change during runs...

Is there an easy fix to correct this? Somewhere in the source randomness must be introduced somewhere, damn.

AttributeError: module 'sympy.core.core' has no attribute 'add'

After updating your library, my public notebook Titanic - Autofeat (automatic FE) for selecting features for solving the task in the Kaggle Competition "Titanic - Machine Learning from Disaster" stopped working.
Early (a year ago) this code worked successfully (autofeat-0.2.5):

model = AutoFeatRegression()
X_train_feature_creation = model.fit_transform(train.to_numpy(), target.to_numpy().flatten())
X_test_feature_creation = model.transform(test.to_numpy())

Now I outputs an error (autofeat-2.0.4):

model = AutoFeatRegressor()
X_train_feature_creation = model.fit_transform(train.to_numpy(), target.to_numpy().flatten())
X_test_feature_creation = model.transform(test.to_numpy())

I've tried this as well:

X_train_feature_creation = model.fit_transform(train, target)
X_test_feature_creation = model.transform(test)

and:

model.fit(train, target)

I got the same error.

Thanks in advance for the advice on how to fix this.

Cannot reproduce results

Every time I call fit_transform I get different results.

I noticed that np.random.permutation changes the random_state, so I used np.random.RandomState(seed=seed).permutation() to solve.

I also noticed that np.random.seed(i) is used in run_select_features, but it changes the random state in the same way, so I can always convert back to the random_state that I had.

Even with those changes, and always getting the same random_state after calling fit_transform, I always end up with different results.

Model details

Hi? first of all, great work!!! you have just made something which is far more better than featuretools, manual feature engineering etc so congratulations!

I have two questions, the classifier and regressor, are they ensemble methods? can I know the maths behind them? and secondly after training it shows final score at the end, is it training score or cross validation?

theoretical Question

I have a question:

Does the features created by autofeat could be as efficient if used in an other ML algorithm or in a classification problem ?

Multiclass Classification Problem

I know this is only for regression, but the approach which I am thinking of is, I fit_transform the entire data frame containing multi labels in 'y' column and feed the obtained data frame to some other model. will it work? or is there any other better approach? (provide code example if possible).

thanks.

Speed up tranform()

Thank you for the great product!

My question is - how can I speed up .tranform() function. I use for input dataframe with only ONE row, but the time needed to transform original features to new is to much and I didn't understand why. Isn't it just needed to apply stored formulas for the new dataset with original features?

May be you know how to speed up this process?

Issues with sympy module

I ran autofeat on a set of fully numeric features with no nulls

using

afreg = AutoFeatClassifier(verbose=1, feateng_steps=2, max_gb=10, n_jobs=12)
ft_train_tr = afreg.fit_transform(ft_train, target)

the verbose logging showed:

[AutoFeat] The 2 step feature engineering process could generate up to 980700 features.
[AutoFeat] With 200000 data points this new feature matrix would use about 784.56 gb of space.
[AutoFeat] As you specified a limit of 10 gb, the number of data points is subsampled to 2549
[feateng] Step 1: transformation of original features
[feateng] Generated 915 transformed features from 200 original features - done.
[feateng] Step 2: first combination of features
[feateng] Generated 620258 feature combinations from 621055 original feature tuples - done.
[feateng] Generated altogether 621174 new features in 2 steps
[feateng] Removing correlated features, as well as additions at the highest level

then I got error:

AttributeErrorTraceback (most recent call last)
<ipython-input-10-5be3e98d3950> in <module>
      1 afreg = AutoFeatClassifier(verbose=1, feateng_steps=2, max_gb=10, n_jobs=12)
----> 2 ft_train_tr = afreg.fit_transform(ft_train, target)

/opt/conda/lib/python3.8/site-packages/autofeat/autofeat.py in fit_transform(self, X, y)
    298             target_sub = target.copy()
    299         # generate features
--> 300         df_subs, self.feature_formulas_ = engineer_features(df_subs, self.feateng_cols_, _parse_units(self.units, verbose=self.verbose),
    301                                                             self.feateng_steps, self.transformations, self.verbose)
    302         # select predictive features

/opt/conda/lib/python3.8/site-packages/autofeat/feateng.py in engineer_features(df_org, start_features, units, max_steps, transformations, verbose)
    343         print("[feateng] Generated altogether %i new features in %i steps" % (len(feature_pool) - len(start_features), max_steps))
    344         print("[feateng] Removing correlated features, as well as additions at the highest level")
--> 345     feature_pool = {c: feature_pool[c] for c in feature_pool if c in uncorr_features and not feature_pool[c].func == sympy.add.Add}
    346     cols = [c for c in list(df.columns) if c in feature_pool and c not in df_org.columns]  # categorical cols not in feature_pool
    347     if cols:

/opt/conda/lib/python3.8/site-packages/autofeat/feateng.py in <dictcomp>(.0)
    343         print("[feateng] Generated altogether %i new features in %i steps" % (len(feature_pool) - len(start_features), max_steps))
    344         print("[feateng] Removing correlated features, as well as additions at the highest level")
--> 345     feature_pool = {c: feature_pool[c] for c in feature_pool if c in uncorr_features and not feature_pool[c].func == sympy.add.Add}
    346     cols = [c for c in list(df.columns) if c in feature_pool and c not in df_org.columns]  # categorical cols not in feature_pool
    347     if cols:

AttributeError: module 'sympy' has no attribute 'add'

the ft_train columns are all of type float16 the target is a pandas series with values array([0, 1])

any ideas what is wrong?

How to transform new data?

Does this library have some methods to transform new dataset to dataset with additional features generated before?

So I need to make this steps:

  1. Generate additional features for my dataset.
  2. Save transformation matrix.
  3. Apply transformation matrix on new dataset to predict it.

Is it now possible?

Data validation error when using Buckingham's Pi Theorem on Classification task

Hi!
While trying to use the AutoFeatClassifier using units, I stumbled upon a validation error caused by an infinite value.
Presumably one of the generated features (I assume from the ones coming from the Pi theorem) has an infinite value, which breaks the StandardScaler used during the filtering of correlated features.
This is how I am calling the classifier, with fitting the training data that comes in a numpy ndarray

auto = AutoFeatClassifier(categorical_cols=categorical_cols, units=units, verbose=1, feateng_steps=3, featsel_runs=5, n_jobs=5, apply_pi_theorem=True)
X_train_new = auto.fit_transform(X_train_sampled, y_train_sampled)

These are the features logged for the Pi Theorem, and all of them include divisions (which could lead to a division by 0 issue).

...
[AutoFeat] Applying the Pi Theorem
[AutoFeat] Pi Theorem 1:  x002 / x001
[AutoFeat] Pi Theorem 2:  x006 / x000
[AutoFeat] Pi Theorem 3:  x010 / x005
[AutoFeat] Pi Theorem 4:  x003 / x001
[AutoFeat] Pi Theorem 5:  x013 / x001
[AutoFeat] Pi Theorem 6:  x014 / x005
[AutoFeat] Pi Theorem 7:  x000 * x005 * x012 / x015
[AutoFeat] Pi Theorem 8:  x016 / x000
[AutoFeat] Pi Theorem 9:  x017 / x012
...

The full logs output by a failing run is the following:

[AutoFeat] Applying the Pi Theorem
[AutoFeat] Pi Theorem 1:  x002 / x001
[AutoFeat] Pi Theorem 2:  x006 / x000
[AutoFeat] Pi Theorem 3:  x007 / x005
[AutoFeat] Pi Theorem 4:  x003 / x001
[AutoFeat] Pi Theorem 5:  x009 / x001
[AutoFeat] Pi Theorem 6:  x010 / x005
[AutoFeat] Pi Theorem 7:  x000 * x005 * x008 / x011
[AutoFeat] Pi Theorem 8:  x012 / x000
[AutoFeat] Pi Theorem 9:  x013 / x008
[AutoFeat] The 3 step feature engineering process could generate up to 118923 features.
[AutoFeat] With 121 data points this new feature matrix would use about 0.06 gb of space.
[feateng] Step 1: transformation of original features
[feateng] Generated 40 transformed features from 14 original features - done.
[feateng] Step 2: first combination of features
[feateng] Generated 1524 feature combinations from 1431 original feature tuples - done.
[feateng] Step 3: transformation of new features
[feateng] Generated 4564 transformed features from 1524 original features - done.
[feateng] Generated altogether 6233 new features in 3 steps
[feateng] Removing correlated features, as well as additions at the highest level

And after that, the error is reported with the following stack trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-323-53dcdfc1b68e> in <module>
     32     # categorical_cols = []
     33     auto = AutoFeatClassifier(categorical_cols=categorical_cols, units=units, verbose=1, feateng_steps=3, featsel_runs=5, n_jobs=5, apply_pi_theorem=True)
---> 34     X_train_new = auto.fit_transform(X_train_sampled, y_train_sampled)
     35     X_test_new = auto.transform(X_test.to_numpy())
     36     pretty_names = feature_names(auto, USEFUL_ACTUALS)

~/.pyenv/versions/features/lib/python3.7/site-packages/autofeat/autofeat.py in fit_transform(self, X, y)
    299         # generate features
    300         df_subs, self.feature_formulas_ = engineer_features(df_subs, self.feateng_cols_, _parse_units(self.units, verbose=self.verbose),
--> 301                                                             self.feateng_steps, self.transformations, self.verbose)
    302         # select predictive features
    303         if self.featsel_runs <= 0:

~/.pyenv/versions/features/lib/python3.7/site-packages/autofeat/feateng.py in engineer_features(df_org, start_features, units, max_steps, transformations, verbose)
    354     if cols:
    355         # check for correlated features again; this time with the start features
--> 356         corrs = dict(zip(cols, np.max(np.abs(np.dot(StandardScaler().fit_transform(df[cols]).T, StandardScaler().fit_transform(df_org))/df_org.shape[0]), axis=1)))
    357         cols = [c for c in cols if corrs[c] < 0.9]
    358     cols = list(df_org.columns) + cols

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    688         if y is None:
    689             # fit method of arity 1 (unsupervised transformation)
--> 690             return self.fit(X, **fit_params).transform(X)
    691         else:
    692             # fit method of arity 2 (supervised transformation)

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y)
    665         # Reset internal state before fitting
    666         self._reset()
--> 667         return self.partial_fit(X, y)
    668 
    669     def partial_fit(self, X, y=None):

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y)
    696         X = self._validate_data(X, accept_sparse=('csr', 'csc'),
    697                                 estimator=self, dtype=FLOAT_DTYPES,
--> 698                                 force_all_finite='allow-nan')
    699 
    700         # Even in the case of `with_mean=False`, we update the mean anyway

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    418                     f"requires y to be passed, but the target y is None."
    419                 )
--> 420             X = check_array(X, **check_params)
    421             out = X
    422         else:

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    643         if force_all_finite:
    644             _assert_all_finite(array,
--> 645                                allow_nan=force_all_finite == 'allow-nan')
    646 
    647     if ensure_min_samples > 0:

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
     97                     msg_err.format
     98                     (type_err,
---> 99                      msg_dtype if msg_dtype is not None else X.dtype)
    100             )
    101     # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains infinity or a value too large for dtype('float64').

I tried removing all the constant features from the original dataset, so that all the original features have std() > 0.
Looks like a feature generated has a division by zero somewhere that leads to an infinite value somewhere deep in the generated features.
Maybe there should be some handling there, ignoring the feature or replacing the infinites with NaN which the scalers know to ignore?

what metric used for classification?

can you pls clarify what metric used for classification?
is it possible to use F1 metric for optimization, since data in real live is unbalanced

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.