cod3licious / autofeat Goto Github PK

Linear Prediction Model with Automated Feature Engineering and Selection Capabilities

License: MIT License

Python 69.26% Jupyter Notebook 30.74%

automated-data-science automated-feature-engineering automated-machine-learning automl feature-engineering feature-selection linear-regression machine-learning machine-learning-models

autofeat's People

Contributors

Stargazers

Watchers

autofeat's Issues

Original feature names are changed to x000, x001, x002,etc., How to avoid this?

Cannot reproduce results

Every time I call fit_transform I get different results.

I noticed that np.random.permutation changes the random_state, so I used np.random.RandomState(seed=seed).permutation() to solve.

I also noticed that np.random.seed(i) is used in run_select_features, but it changes the random state in the same way, so I can always convert back to the random_state that I had.

Even with those changes, and always getting the same random_state after calling fit_transform, I always end up with different results.

My question is - how can I speed up .tranform() function. I use for input dataframe with only ONE row, but the time needed to transform original features to new is to much and I didn't understand why. Isn't it just needed to apply stored formulas for the new dataset with original features?

May be you know how to speed up this process?

Input contains NaN, infinity or a value too large for dtype('float32') on fit_transform

Facing the following issue when running AutoFeatRegressor.fit_transform(featuresDf, targetFeature):

Already checked if there are any infinity values or nan. Also, converted everything to float32. Any pointers? Thanks!

Update: tried the same set of inputs with FeatureSelector and everything is working great.

Update 2: posted this question on StackOverflow

Exception

c:\dox\rnd\ml-pipeline-notebooks\modules\autoFeat.py in augmentFeatures(features, targetFeature, verbose)
     41     print(featuresDf.head())
     42     print(targetFeature)
---> 43     newFeatureDf = autoFeatRegressor.fit_transform(featuresDf, targetFeature)
     44 
     45     if verbose:

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    112         ):
    113             type_err = "infinity" if allow_nan else "NaN, infinity"
--> 114             raise ValueError(
    115                 msg_err.format(
    116                     type_err, msg_dtype if msg_dtype is not None else X.dtype

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Logs (verbose)

Correlation matrix can have inconsistent column and row names

#28 introduced a nice speed improvement by using numpy instead of pandas for correlation.

There seems to be a small bug on this line though:
https://github.com/cod3licious/autofeat/blob/master/autofeat/autofeatlight.py#L35

By not specifying an index, the correlation matrix might have feature names as column headers but numeric row indices. This in turn throws off the elimination of correlated columns.

Adding index=df.columns recovers the old behavior:

from pandas.testing import assert_frame_equal

corrmat_old = df.corr().abs()
corrmat = pd.DataFrame(
    np.abs(np.corrcoef(df.values, rowvar=False)), columns=df.columns, index=df.columns
)

assert_frame_equal(corrmat_old, corrmat)

least_angle.py in lars_path: ValueError: operands could not be broadcast together with shapes

Known issue related to sklearn bug: scikit-learn/scikit-learn#9603
Sometimes (randomly) during the feature selection process, the LassoLARS model crashes with an error like this

.../lib/python3.6/site-packages/sklearn/linear_model/least_angle.py in lars_path(X, y, Xy, Gram, max_iter, alpha_min, method, copy_X, eps, copy_Gram, verbose, return_path, return_n_iter, positive)
    378                                  least_squares)
    379 
--> 380         g1 = arrayfuncs.min_pos((C - Cov) / (AA - corr_eq_dir + tiny32))
    381         if positive:
    382             gamma_ = min(g1, C / AA)

ValueError: operands could not be broadcast together with shapes (397,) (396,)

Try updating your sklearn and numpy versions; otherwise I'm afraid there is not much that could be done here.

Parallelization

Great package!

Have you looked into using joblib on this loop,
for i, (feat1, feat2) in enumerate(feature_tuples)
in feateng.py?

Seems like it could be easily parallelized. You have several global variables inside the loop, so I haven't had a chance to deconstruct this into a function that could be called via joblib. I thought it might be more straightforward for you as you are more familiar with the code.

Cython implementation

I was thinking of using cython to run autofeat in C and changing some of the variables to static. Do you think this would make autofeat significantly faster?

How to choose sin(x) and cos(x) etl. as features?

Hi,

first of all thanks for the super libary.
I would like to ask, how can i get sin(x) or cos(x) also as features?
Such as i have a target func: (x-np.sqrt(2))(np.sin(8np.pi*x))**2

theoretical Question

I have a question:

Does the features created by autofeat could be as efficient if used in an other ML algorithm or in a classification problem ?

MemoryError: Unable to allocate 2.05 GiB for an array with shape (501, 550174) and data type float64

Any idea how this error can be fixed? I have an i7 8700K clocked at 4.9 GHZ and 32 GBs of RAM (2x16)

Scaling and Autofeat

Thank you for the awesome library.

Just have a question about feature scaling, I've read that features need to be scaled BEFORE using autofeat (https://analyticsindiamag.com/guide-to-automatic-feature-engineering-using-autofeat/).

Is this true? I can't seem to find information in the repository that supports this statement.

When running autofeat is seems to perform scaling during feature selection so I am just wondering if I have to do scaling beforehand or if I can just throw my data into autofeat and it is taken care of.

AutoFeatLight.fit_trasnform() take 2 positional arguments 3 given

I'm already put 2 arguments: AutoFeatLight.fit_trasnform(X,y).
I tried to put Dataframe, then 2Darray. then try to make X has one feature, then both X and y are one dimensional array.
No way. all of them give me the same error

Reproducibility issue

Hello,

I noticed that results are not reproducible by using the library i.e. when using sklearn drop-down-replacement classes, they will each time produce slightly different results.

For example, when using:

features_engineer = AutoFeatClassifier()
features_engineer.fit_transform(data_train.data, data_train.target.value)

, it will calculate (or select) different features each time.

The issue above I temporarily fixed by using:

 random.seed(seed)
 np.random.seed(seed)

, so that the outputs produced by AutoFeatClassifier stay constant among runs.

However, when I tried using the following:

selector = FeatureSelector(verbose=self.verbose, problem_type="classification", featsel_runs=5)
selector.fit_transform(df_indices, target)

, the above-mentioned seed setting trick didn't translate into desirable outcome - the selected features still change during runs...

Is there an easy fix to correct this? Somewhere in the source randomness must be introduced somewhere, damn.

error: Unable to find vcvarsall.bat

I got this error on windows 10:

df = model.fit_transform(X, y)
[AutoFeatRegression] The 3 step feature engineering process could generate up to 2864745 features.
[AutoFeatRegression] With 48573 data points this new feature matrix would use about 556.60 gb of space.
Step 1: transformation of original features
              0/             83Traceback (most recent call last):
  File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\utilities\autowrap.py", line 168, in _process_files
    retoutput = check_output(command, stderr=STDOUT)
  File "D:\Anaconda3\envs\quant\lib\subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "D:\Anaconda3\envs\quant\lib\subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['D:\\Anaconda3\\envs\\quant\\python.exe', 'setup.py', 'build_ext', '--inplace']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "D:\Anaconda3\envs\quant\lib\site-packages\IPython\core\interactiveshell.py", line 3291, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-818b183e7273>", line 1, in <module>
    df = model.fit_transform(X, y)
  File "D:\Anaconda3\envs\quant\lib\site-packages\autofeat\autofeat.py", line 249, in fit_transform
    self.feateng_steps, self.transformations)
  File "D:\Anaconda3\envs\quant\lib\site-packages\autofeat\feateng.py", line 249, in generate_features
    original_features.extend(apply_tranformations(original_features))
  File "D:\Anaconda3\envs\quant\lib\site-packages\autofeat\feateng.py", line 194, in apply_tranformations
    f = ufuncify(t, expr_temp)
  File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\core\cache.py", line 94, in wrapper
    retval = cfunc(*args, **kwargs)
  File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\utilities\autowrap.py", line 1105, in ufuncify
    return code_wrapper.wrap_code(routines, helpers=helps)
  File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\utilities\autowrap.py", line 828, in wrap_code
    self._process_files(routines)
  File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\utilities\autowrap.py", line 172, in _process_files
    " ".join(command), e.output.decode('utf-8')))
sympy.utilities.autowrap.CodeWrapError: Error while executing command: D:\Anaconda3\envs\quant\python.exe setup.py build_ext --inplace. Command output is:
running build_ext
running build_src
build_src
building extension "wrapper_module_1" sources
build_src: building npy-pkg config files
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
customize MSVCCompiler using build_ext
building 'wrapper_module_1' extension
compiling C sources
error: Unable to find vcvarsall.bat

How to transform new data?

Does this library have some methods to transform new dataset to dataset with additional features generated before?

So I need to make this steps:

Generate additional features for my dataset.
Save transformation matrix.
Apply transformation matrix on new dataset to predict it.

Is it now possible?

pandas corr is too slow; use numpy instead

When running on a medium dataset (~50K instances, ~35 features), I noticed that autofeat's AutoFeatLight was taking hours and hours to complete. After debugging a bit, the bottleneck was the line:

autofeat/autofeat/autofeatlight.py

Line 35 in d08dd47

corrmat = df.corr().abs()

When I replaced it with Numpy, the code ran orders of magnitude faster:

    corrmat = pd.DataFrame(np.abs(np.corrcoef(df.values, rowvar=False)), columns=df.columns)

Documentation Enhancement for getting model, features, coefficients

name: Documentation Enhancement
about: Request for clarification or improvement in the documentation

Description of the Issue

I've been using the autofeat library and encountered difficulties in obtaining the final model, model features, and coefficients as described in the library's documentation.

The model.get_model(), model.get_model_features(), and model.get_model_coefs() methods do not seem to work as expected.
I would appreciate clarification on how to access the final model, its features, and coefficients or an update to the documentation.

Additional Information

Python Version: [e.g., Python 3.9]
autofeat Library Version: [e.g., autofeat 2.1.2]

Note to Maintainers

Please consider updating the documentation or providing guidance on how to access the final model, features, and coefficients using the AutoFeatRegressor class. This would greatly help users in understanding and utilizing the library effectively.

Thank you for your attention to this matter.

Allow user to pass dict of Pint objects/ureg

Minor point but it would be great if I could create my own dict of ureg objects and pass them instead of a s dict of strings that need parsed.

possible point for verification

https://numpy.org/doc/stable/reference/generated/numpy.exp.html

A review needed? or comment is needed?

https://github.com/cod3licious/autofeat/blob/master/autofeat/feateng.py#L151
https://github.com/cod3licious/autofeat/blob/master/autofeat/feateng.py#L166

Multiclass Classification Problem

I know this is only for regression, but the approach which I am thinking of is, I fit_transform the entire data frame containing multi labels in 'y' column and feed the obtained data frame to some other model. will it work? or is there any other better approach? (provide code example if possible).

thanks.

ValueError: Input X contains NaN.

afreg = AutoFeatRegressor(verbose=1, feateng_steps=3, n_jobs=24)
afreg.fit(x_train, y_train)

ValueError Traceback (most recent call last)
Input In [15], in <cell line: 2>()
1 afreg = AutoFeatRegressor(verbose=1, feateng_steps=1, n_jobs=24)
----> 2 afreg.fit(x_train, y_train)

File /app/conda/b2b-latest/lib/python3.8/site-packages/autofeat/autofeat.py:422, in AutoFeatModel.fit(self, X, y)
420 print("[AutoFeat] Warning: This just calls fit_transform() but does not return the transformed dataframe.")
421 print("[AutoFeat] It is much more efficient to call fit_transform() instead of fit() and transform()!")
--> 422 _ = self.fit_transform(X, y) # noqa
423 return self

File /app/conda/b2b-latest/lib/python3.8/site-packages/autofeat/autofeat.py:353, in AutoFeatModel.fit_transform(self, X, y)
351 else:
352 if self.problem_type in ("regression", "classification"):
--> 353 good_cols = select_features(
354 df_subs, target_sub, self.featsel_runs, None, self.problem_type, self.n_jobs, self.verbose
355 )
356 # if no features were selected, take the original features
357 if not good_cols:

File /app/conda/b2b-latest/lib/python3.8/site-packages/autofeat/featsel.py:245, in select_features(df, target, featsel_runs, keep, problem_type, n_jobs, verbose)
241 def flatten_lists(l: list):
242 return [item for sublist in l for item in sublist]
244 selected_columns = flatten_lists(
--> 245 Parallel(n_jobs=n_jobs, verbose=100 * verbose)(delayed(run_select_features)(i) for i in range(featsel_runs))
246 )
248 if selected_columns:
249 selected_columns = Counter(selected_columns)

File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:1944, in Parallel.call(self, iterable)
1938 # The first item from the output is blank, but it makes the interpreter
1939 # progress until it enters the Try/Except block of the generator and
1940 # reach the first yield statement. This starts the aynchronous
1941 # dispatch of the tasks to the workers.
1942 next(output)
-> 1944 return output if self.return_generator else list(output)

File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:1587, in Parallel._get_outputs(self, iterator, pre_dispatch)
1584 yield
1586 with self._backend.retrieval_context():
-> 1587 yield from self._retrieve()
1589 except GeneratorExit:
1590 # The generator has been garbage collected before being fully
1591 # consumed. This aborts the remaining tasks if possible and warn
1592 # the user if necessary.
1593 self._exception = True

File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:1691, in Parallel._retrieve(self)
1684 while self._wait_retrieval():
1685
1686 # If the callback thread of a worker has signaled that its task
1687 # triggered an exception, or if the retrieval loop has raised an
1688 # exception (e.g. GeneratorExit), exit the loop and surface the
1689 # worker traceback.
1690 if self._aborting:
-> 1691 self._raise_error_fast()
1692 break
1694 # If the next job is not ready for retrieval yet, we just wait for
1695 # async callbacks to progress.

File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:1726, in Parallel._raise_error_fast(self)
1722 # If this error job exists, immediatly raise the error by
1723 # calling get_result. This job might not exists if abort has been
1724 # called directly or if the generator is gc'ed.
1725 if error_job is not None:
-> 1726 error_job.get_result(self.timeout)

File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:735, in BatchCompletionCallBack.get_result(self, timeout)
729 backend = self.parallel._backend
731 if backend.supports_retrieve_callback:
732 # We assume that the result has already been retrieved by the
733 # callback thread, and is stored internally. It's just waiting to
734 # be returned.
--> 735 return self._return_or_raise()
737 # For other backends, the main thread needs to run the retrieval step.
738 try:

File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:753, in BatchCompletionCallBack._return_or_raise(self)
751 try:
752 if self.status == TASK_ERROR:
--> 753 raise self._result
754 return self._result
755 finally:

ValueError: Input X contains NaN.
LassoLarsCV does not accept missing values encoded as NaN natively.

will it work for logistic regression?

will it work for logistic regression?
meaning classification tasks?

ufunc '_lambdifygenerated' did not contain a loop with signature matching types (<class 'numpy.dtype[float32]'>, <class 'numpy.dtype[float32]'>) -> None

Hi, Drs,
X_train and y_train are Dataframe of array of Float64 when I use the following code:

    alldata1= sio.loadmat(path)['input_train'].ravel()
    x_train=pd.DataFrame(alldata1[folds])
    x_train.columns=input_new_columns
    alldata2= sio.loadmat(path)['output_train'].ravel()
    y_train=pd.DataFrame(alldata2[folds])
    y_train.columns=['targat']
    alldata3= sio.loadmat(path)['input_test'].ravel()
    x_test=pd.DataFrame(alldata3[folds])
    x_test.columns=input_new_columns
    alldata4= sio.loadmat(path)['output_test'].ravel()
    y_test=pd.DataFrame(alldata4[folds])
    y_test.columns=['targat']
    alldata5= sio.loadmat(path)['input_vaildation'].ravel()
    x_validation=pd.DataFrame(alldata5[folds])
    x_validation.columns=input_new_columns
    alldata6= sio.loadmat(path)['output_vaildation'].ravel()
    y_validation=pd.DataFrame(alldata6[folds])
    y_validation.columns=['targat']
    afreg = AutoFeatClassifier(feateng_steps=2,categorical_cols=sparse_new_columns)
    x_train_tr = afreg.fit_transform(x_train, y_train)

but I find an error:
UFuncTypeError ufunc '_lambdifygenerated' did not contain a loop with signature matching types (< class 'numpy.dtype[float32]'> , < class 'numpy.dtype[float32]'>) -> None

May I ask what causes this?

In toydata, autofeat finds the correct function (square) only under some circumstances

Hi,
not sure if this is a bug, but I thought the behavior was quite strange. I created some toy data, a set of independent variables x_i in [0, 1) and as target I compute y=x_0^2. The other x_i are noise variables.
Autofeat doesn't find the correct functional form in this case, instead it finds something like a*x_0 + b*1/x_0,
for suitable values of a, b.
However, when I increase the range of x_0 from [0, 1) to [-1, 1), autofeat easily finds the correct functional form.

Here's my code for data generation:

def generate_toydata(n_samples, fun, x_min, x_max, noise_level=0.1):
    x_range = x_max - x_min
    # main feature
    x_main = (np.random.rand(n_samples) * x_range) + x_min
    # Generate 4 unrelated features
    x_unrelated = np.random.rand(n_samples, 4)
    y = fun(x_main) + np.random.normal(0, noise_level, n_samples)
    X = np.column_stack((x_main, x_unrelated))
    return X, y


X, y = generate_toydata(
    n_samples=100,
    fun=np.square,
    x_min=0.0,
    x_max=1.0,
    noise_level=0.01
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False,
)

ImportError: cannot import name 'AutoFeatRegressor' from 'autofeat'

[enhancement] add predict_proba for classifiers

Autofeat provides only predict for classification problems, it should also provide predict_proba for use cases when we're not interested in 'hard prediction' but 'ranking'. I think it should be hard to expose it from underlying sklearn models.

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Getting this error while using the benchmark of autofeat.AutoFeatClassifier

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 8
      6 import numpy as np
      7 import pandas as pd
----> 8 from autofeat import AutoFeatClassifier
      9 from sklearn.datasets import load_breast_cancer, load_iris, load_wine
     10 from sklearn.ensemble import RandomForestClassifier

File ~/my-conda-envs/fraud/lib/python3.8/site-packages/autofeat/__init__.py:6
      4 name = "autofeat"
      5 __version__ = "2.1.0"
----> 6 from .autofeatlight import AutoFeatLight  # noqa
      7 from .autofeat import AutoFeatModel, AutoFeatRegressor, AutoFeatClassifier  # noqa
      8 from .featsel import FeatureSelector  # noqa

File ~/my-conda-envs/fraud/lib/python3.8/site-packages/autofeat/autofeatlight.py:56
     51         print(sorted(useless_cols))
     52     return [c for c in df.columns if c not in useless_cols]
     55 def _compute_additional_features(
---> 56     X: np.ndarray, feature_names: list | None = None, compute_ratio: bool = True, compute_product: bool = True, verbose: int = 0
     57 ) -> Tuple[np.ndarray, list]:
     58     """
     59     Compute additional non-linear features from the original features (ratio or product of two features).
     60 
   (...)
     69         - list with n_additional_features names describing the newly computed features
     70     """
     71     # check how many new features we will compute

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

AttributeError: module 'sympy.core.core' has no attribute 'add'

After updating your library, my public notebook Titanic - Autofeat (automatic FE) for selecting features for solving the task in the Kaggle Competition "Titanic - Machine Learning from Disaster" stopped working.
Early (a year ago) this code worked successfully (autofeat-0.2.5):

model = AutoFeatRegression()
X_train_feature_creation = model.fit_transform(train.to_numpy(), target.to_numpy().flatten())
X_test_feature_creation = model.transform(test.to_numpy())

Now I outputs an error (autofeat-2.0.4):

model = AutoFeatRegressor()
X_train_feature_creation = model.fit_transform(train.to_numpy(), target.to_numpy().flatten())
X_test_feature_creation = model.transform(test.to_numpy())

I've tried this as well:

X_train_feature_creation = model.fit_transform(train, target)
X_test_feature_creation = model.transform(test)

and:

model.fit(train, target)

I got the same error.

Thanks in advance for the advice on how to fix this.

Is it possible to use autofeat without exceeding memory of the system?

i.e. does autofeat work on vaex or dask?

Data validation error when using Buckingham's Pi Theorem on Classification task

Hi!
While trying to use the AutoFeatClassifier using units, I stumbled upon a validation error caused by an infinite value.
Presumably one of the generated features (I assume from the ones coming from the Pi theorem) has an infinite value, which breaks the StandardScaler used during the filtering of correlated features.
This is how I am calling the classifier, with fitting the training data that comes in a numpy ndarray

auto = AutoFeatClassifier(categorical_cols=categorical_cols, units=units, verbose=1, feateng_steps=3, featsel_runs=5, n_jobs=5, apply_pi_theorem=True)
X_train_new = auto.fit_transform(X_train_sampled, y_train_sampled)

These are the features logged for the Pi Theorem, and all of them include divisions (which could lead to a division by 0 issue).

...
[AutoFeat] Applying the Pi Theorem
[AutoFeat] Pi Theorem 1:  x002 / x001
[AutoFeat] Pi Theorem 2:  x006 / x000
[AutoFeat] Pi Theorem 3:  x010 / x005
[AutoFeat] Pi Theorem 4:  x003 / x001
[AutoFeat] Pi Theorem 5:  x013 / x001
[AutoFeat] Pi Theorem 6:  x014 / x005
[AutoFeat] Pi Theorem 7:  x000 * x005 * x012 / x015
[AutoFeat] Pi Theorem 8:  x016 / x000
[AutoFeat] Pi Theorem 9:  x017 / x012
...

The full logs output by a failing run is the following:

[AutoFeat] Applying the Pi Theorem
[AutoFeat] Pi Theorem 1:  x002 / x001
[AutoFeat] Pi Theorem 2:  x006 / x000
[AutoFeat] Pi Theorem 3:  x007 / x005
[AutoFeat] Pi Theorem 4:  x003 / x001
[AutoFeat] Pi Theorem 5:  x009 / x001
[AutoFeat] Pi Theorem 6:  x010 / x005
[AutoFeat] Pi Theorem 7:  x000 * x005 * x008 / x011
[AutoFeat] Pi Theorem 8:  x012 / x000
[AutoFeat] Pi Theorem 9:  x013 / x008
[AutoFeat] The 3 step feature engineering process could generate up to 118923 features.
[AutoFeat] With 121 data points this new feature matrix would use about 0.06 gb of space.
[feateng] Step 1: transformation of original features
[feateng] Generated 40 transformed features from 14 original features - done.
[feateng] Step 2: first combination of features
[feateng] Generated 1524 feature combinations from 1431 original feature tuples - done.
[feateng] Step 3: transformation of new features
[feateng] Generated 4564 transformed features from 1524 original features - done.
[feateng] Generated altogether 6233 new features in 3 steps
[feateng] Removing correlated features, as well as additions at the highest level

And after that, the error is reported with the following stack trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-323-53dcdfc1b68e> in <module>
     32     # categorical_cols = []
     33     auto = AutoFeatClassifier(categorical_cols=categorical_cols, units=units, verbose=1, feateng_steps=3, featsel_runs=5, n_jobs=5, apply_pi_theorem=True)
---> 34     X_train_new = auto.fit_transform(X_train_sampled, y_train_sampled)
     35     X_test_new = auto.transform(X_test.to_numpy())
     36     pretty_names = feature_names(auto, USEFUL_ACTUALS)

~/.pyenv/versions/features/lib/python3.7/site-packages/autofeat/autofeat.py in fit_transform(self, X, y)
    299         # generate features
    300         df_subs, self.feature_formulas_ = engineer_features(df_subs, self.feateng_cols_, _parse_units(self.units, verbose=self.verbose),
--> 301                                                             self.feateng_steps, self.transformations, self.verbose)
    302         # select predictive features
    303         if self.featsel_runs <= 0:

~/.pyenv/versions/features/lib/python3.7/site-packages/autofeat/feateng.py in engineer_features(df_org, start_features, units, max_steps, transformations, verbose)
    354     if cols:
    355         # check for correlated features again; this time with the start features
--> 356         corrs = dict(zip(cols, np.max(np.abs(np.dot(StandardScaler().fit_transform(df[cols]).T, StandardScaler().fit_transform(df_org))/df_org.shape[0]), axis=1)))
    357         cols = [c for c in cols if corrs[c] < 0.9]
    358     cols = list(df_org.columns) + cols

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    688         if y is None:
    689             # fit method of arity 1 (unsupervised transformation)
--> 690             return self.fit(X, **fit_params).transform(X)
    691         else:
    692             # fit method of arity 2 (supervised transformation)

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y)
    665         # Reset internal state before fitting
    666         self._reset()
--> 667         return self.partial_fit(X, y)
    668 
    669     def partial_fit(self, X, y=None):

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y)
    696         X = self._validate_data(X, accept_sparse=('csr', 'csc'),
    697                                 estimator=self, dtype=FLOAT_DTYPES,
--> 698                                 force_all_finite='allow-nan')
    699 
    700         # Even in the case of `with_mean=False`, we update the mean anyway

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    418                     f"requires y to be passed, but the target y is None."
    419                 )
--> 420             X = check_array(X, **check_params)
    421             out = X
    422         else:

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    643         if force_all_finite:
    644             _assert_all_finite(array,
--> 645                                allow_nan=force_all_finite == 'allow-nan')
    646 
    647     if ensure_min_samples > 0:

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
     97                     msg_err.format
     98                     (type_err,
---> 99                      msg_dtype if msg_dtype is not None else X.dtype)
    100             )
    101     # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains infinity or a value too large for dtype('float64').

I tried removing all the constant features from the original dataset, so that all the original features have std() > 0.
Looks like a feature generated has a division by zero somewhere that leads to an infinite value somewhere deep in the generated features.
Maybe there should be some handling there, ignoring the feature or replacing the infinites with NaN which the scalers know to ignore?

what metric used for classification?

can you pls clarify what metric used for classification?
is it possible to use F1 metric for optimization, since data in real live is unbalanced

Model details

Hi? first of all, great work!!! you have just made something which is far more better than featuretools, manual feature engineering etc so congratulations!

I have two questions, the classifier and regressor, are they ensemble methods? can I know the maths behind them? and secondly after training it shows final score at the end, is it training score or cross validation?

can you share simple example to see this mentioned overfit

can you share simple example to see this mentioned overfit
Please keep in mind that since the AutoFeatRegressor and AutoFeatClassifier models can generate very complex features, it will likely overfit on noise in the dataset, though the coefficients for features related to noise should be fairly small. It is generally suggested to carefully inspect the features found by autofeat and use those that make sense to you to train your own models

Issues with sympy module

I ran autofeat on a set of fully numeric features with no nulls

using

afreg = AutoFeatClassifier(verbose=1, feateng_steps=2, max_gb=10, n_jobs=12)
ft_train_tr = afreg.fit_transform(ft_train, target)

the verbose logging showed:

[AutoFeat] The 2 step feature engineering process could generate up to 980700 features.
[AutoFeat] With 200000 data points this new feature matrix would use about 784.56 gb of space.
[AutoFeat] As you specified a limit of 10 gb, the number of data points is subsampled to 2549
[feateng] Step 1: transformation of original features
[feateng] Generated 915 transformed features from 200 original features - done.
[feateng] Step 2: first combination of features
[feateng] Generated 620258 feature combinations from 621055 original feature tuples - done.
[feateng] Generated altogether 621174 new features in 2 steps
[feateng] Removing correlated features, as well as additions at the highest level

then I got error:

AttributeErrorTraceback (most recent call last)
<ipython-input-10-5be3e98d3950> in <module>
      1 afreg = AutoFeatClassifier(verbose=1, feateng_steps=2, max_gb=10, n_jobs=12)
----> 2 ft_train_tr = afreg.fit_transform(ft_train, target)

/opt/conda/lib/python3.8/site-packages/autofeat/autofeat.py in fit_transform(self, X, y)
    298             target_sub = target.copy()
    299         # generate features
--> 300         df_subs, self.feature_formulas_ = engineer_features(df_subs, self.feateng_cols_, _parse_units(self.units, verbose=self.verbose),
    301                                                             self.feateng_steps, self.transformations, self.verbose)
    302         # select predictive features

/opt/conda/lib/python3.8/site-packages/autofeat/feateng.py in engineer_features(df_org, start_features, units, max_steps, transformations, verbose)
    343         print("[feateng] Generated altogether %i new features in %i steps" % (len(feature_pool) - len(start_features), max_steps))
    344         print("[feateng] Removing correlated features, as well as additions at the highest level")
--> 345     feature_pool = {c: feature_pool[c] for c in feature_pool if c in uncorr_features and not feature_pool[c].func == sympy.add.Add}
    346     cols = [c for c in list(df.columns) if c in feature_pool and c not in df_org.columns]  # categorical cols not in feature_pool
    347     if cols:

/opt/conda/lib/python3.8/site-packages/autofeat/feateng.py in <dictcomp>(.0)
    343         print("[feateng] Generated altogether %i new features in %i steps" % (len(feature_pool) - len(start_features), max_steps))
    344         print("[feateng] Removing correlated features, as well as additions at the highest level")
--> 345     feature_pool = {c: feature_pool[c] for c in feature_pool if c in uncorr_features and not feature_pool[c].func == sympy.add.Add}
    346     cols = [c for c in list(df.columns) if c in feature_pool and c not in df_org.columns]  # categorical cols not in feature_pool
    347     if cols:

AttributeError: module 'sympy' has no attribute 'add'

the ft_train columns are all of type float16 the target is a pandas series with values array([0, 1])

any ideas what is wrong?

can you clarify if categorical features can be used for your classifier code?

can you clarify if categorical features can be used for your classifier code?
and do you create combinatoric combinations for categorical features

cod3licious / autofeat Goto Github PK

autofeat's People

Contributors

Stargazers

Watchers

Forkers

autofeat's Issues

Exception

Logs (verbose)

Recommend Projects

Recommend Topics

Recommend Org