cod3licious / autofeat Goto Github PK
View Code? Open in Web Editor NEWLinear Prediction Model with Automated Feature Engineering and Selection Capabilities
License: MIT License
Linear Prediction Model with Automated Feature Engineering and Selection Capabilities
License: MIT License
Every time I call fit_transform I get different results.
I noticed that np.random.permutation changes the random_state, so I used np.random.RandomState(seed=seed).permutation() to solve.
I also noticed that np.random.seed(i) is used in run_select_features, but it changes the random state in the same way, so I can always convert back to the random_state that I had.
Even with those changes, and always getting the same random_state after calling fit_transform, I always end up with different results.
Thank you for the great product!
My question is - how can I speed up .tranform() function. I use for input dataframe with only ONE row, but the time needed to transform original features to new is to much and I didn't understand why. Isn't it just needed to apply stored formulas for the new dataset with original features?
May be you know how to speed up this process?
Facing the following issue when running AutoFeatRegressor.fit_transform(featuresDf, targetFeature)
:
Already checked if there are any infinity values or nan. Also, converted everything to float32. Any pointers? Thanks!
Update: tried the same set of inputs with FeatureSelector
and everything is working great.
Update 2: posted this question on StackOverflow
c:\dox\rnd\ml-pipeline-notebooks\modules\autoFeat.py in augmentFeatures(features, targetFeature, verbose)
41 print(featuresDf.head())
42 print(targetFeature)
---> 43 newFeatureDf = autoFeatRegressor.fit_transform(featuresDf, targetFeature)
44
45 if verbose:
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
112 ):
113 type_err = "infinity" if allow_nan else "NaN, infinity"
--> 114 raise ValueError(
115 msg_err.format(
116 type_err, msg_dtype if msg_dtype is not None else X.dtype
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
#28 introduced a nice speed improvement by using numpy instead of pandas for correlation.
There seems to be a small bug on this line though:
https://github.com/cod3licious/autofeat/blob/master/autofeat/autofeatlight.py#L35
By not specifying an index, the correlation matrix might have feature names as column headers but numeric row indices. This in turn throws off the elimination of correlated columns.
Adding index=df.columns
recovers the old behavior:
from pandas.testing import assert_frame_equal
corrmat_old = df.corr().abs()
corrmat = pd.DataFrame(
np.abs(np.corrcoef(df.values, rowvar=False)), columns=df.columns, index=df.columns
)
assert_frame_equal(corrmat_old, corrmat)
Known issue related to sklearn bug: scikit-learn/scikit-learn#9603
Sometimes (randomly) during the feature selection process, the LassoLARS model crashes with an error like this
.../lib/python3.6/site-packages/sklearn/linear_model/least_angle.py in lars_path(X, y, Xy, Gram, max_iter, alpha_min, method, copy_X, eps, copy_Gram, verbose, return_path, return_n_iter, positive)
378 least_squares)
379
--> 380 g1 = arrayfuncs.min_pos((C - Cov) / (AA - corr_eq_dir + tiny32))
381 if positive:
382 gamma_ = min(g1, C / AA)
ValueError: operands could not be broadcast together with shapes (397,) (396,)
Try updating your sklearn and numpy versions; otherwise I'm afraid there is not much that could be done here.
Great package!
Have you looked into using joblib on this loop,
for i, (feat1, feat2) in enumerate(feature_tuples)
in feateng.py?
Seems like it could be easily parallelized. You have several global variables inside the loop, so I haven't had a chance to deconstruct this into a function that could be called via joblib. I thought it might be more straightforward for you as you are more familiar with the code.
I was thinking of using cython to run autofeat in C and changing some of the variables to static. Do you think this would make autofeat significantly faster?
Hi,
first of all thanks for the super libary.
I would like to ask, how can i get sin(x) or cos(x) also as features?
Such as i have a target func: (x-np.sqrt(2))(np.sin(8np.pi*x))**2
BR
I have a question:
Does the features created by autofeat could be as efficient if used in an other ML algorithm or in a classification problem ?
Any idea how this error can be fixed? I have an i7 8700K clocked at 4.9 GHZ and 32 GBs of RAM (2x16)
Thank you for the awesome library.
Just have a question about feature scaling, I've read that features need to be scaled BEFORE using autofeat (https://analyticsindiamag.com/guide-to-automatic-feature-engineering-using-autofeat/).
Is this true? I can't seem to find information in the repository that supports this statement.
When running autofeat is seems to perform scaling during feature selection so I am just wondering if I have to do scaling beforehand or if I can just throw my data into autofeat and it is taken care of.
I'm already put 2 arguments: AutoFeatLight.fit_trasnform(X,y).
I tried to put Dataframe, then 2Darray. then try to make X has one feature, then both X and y are one dimensional array.
No way. all of them give me the same error
Hello,
I noticed that results are not reproducible by using the library i.e. when using sklearn drop-down-replacement classes, they will each time produce slightly different results.
For example, when using:
features_engineer = AutoFeatClassifier()
features_engineer.fit_transform(data_train.data, data_train.target.value)
, it will calculate (or select) different features each time.
The issue above I temporarily fixed by using:
random.seed(seed)
np.random.seed(seed)
, so that the outputs produced by AutoFeatClassifier
stay constant among runs.
However, when I tried using the following:
selector = FeatureSelector(verbose=self.verbose, problem_type="classification", featsel_runs=5)
selector.fit_transform(df_indices, target)
, the above-mentioned seed setting trick didn't translate into desirable outcome - the selected features still change during runs...
Is there an easy fix to correct this? Somewhere in the source randomness must be introduced somewhere, damn.
I got this error on windows 10:
df = model.fit_transform(X, y)
[AutoFeatRegression] The 3 step feature engineering process could generate up to 2864745 features.
[AutoFeatRegression] With 48573 data points this new feature matrix would use about 556.60 gb of space.
Step 1: transformation of original features
0/ 83Traceback (most recent call last):
File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\utilities\autowrap.py", line 168, in _process_files
retoutput = check_output(command, stderr=STDOUT)
File "D:\Anaconda3\envs\quant\lib\subprocess.py", line 356, in check_output
**kwargs).stdout
File "D:\Anaconda3\envs\quant\lib\subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['D:\\Anaconda3\\envs\\quant\\python.exe', 'setup.py', 'build_ext', '--inplace']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\Anaconda3\envs\quant\lib\site-packages\IPython\core\interactiveshell.py", line 3291, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-19-818b183e7273>", line 1, in <module>
df = model.fit_transform(X, y)
File "D:\Anaconda3\envs\quant\lib\site-packages\autofeat\autofeat.py", line 249, in fit_transform
self.feateng_steps, self.transformations)
File "D:\Anaconda3\envs\quant\lib\site-packages\autofeat\feateng.py", line 249, in generate_features
original_features.extend(apply_tranformations(original_features))
File "D:\Anaconda3\envs\quant\lib\site-packages\autofeat\feateng.py", line 194, in apply_tranformations
f = ufuncify(t, expr_temp)
File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\core\cache.py", line 94, in wrapper
retval = cfunc(*args, **kwargs)
File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\utilities\autowrap.py", line 1105, in ufuncify
return code_wrapper.wrap_code(routines, helpers=helps)
File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\utilities\autowrap.py", line 828, in wrap_code
self._process_files(routines)
File "D:\Anaconda3\envs\quant\lib\site-packages\sympy\utilities\autowrap.py", line 172, in _process_files
" ".join(command), e.output.decode('utf-8')))
sympy.utilities.autowrap.CodeWrapError: Error while executing command: D:\Anaconda3\envs\quant\python.exe setup.py build_ext --inplace. Command output is:
running build_ext
running build_src
build_src
building extension "wrapper_module_1" sources
build_src: building npy-pkg config files
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
customize MSVCCompiler using build_ext
building 'wrapper_module_1' extension
compiling C sources
error: Unable to find vcvarsall.bat
Does this library have some methods to transform new dataset to dataset with additional features generated before?
So I need to make this steps:
Is it now possible?
When running on a medium dataset (~50K instances, ~35 features), I noticed that autofeat
's AutoFeatLight
was taking hours and hours to complete. After debugging a bit, the bottleneck was the line:
autofeat/autofeat/autofeatlight.py
Line 35 in d08dd47
When I replaced it with Numpy, the code ran orders of magnitude faster:
corrmat = pd.DataFrame(np.abs(np.corrcoef(df.values, rowvar=False)), columns=df.columns)
name: Documentation Enhancement
about: Request for clarification or improvement in the documentation
Description of the Issue
I've been using the autofeat
library and encountered difficulties in obtaining the final model, model features, and coefficients as described in the library's documentation.
model.get_model()
, model.get_model_features()
, and model.get_model_coefs()
methods do not seem to work as expected.Additional Information
autofeat
Library Version: [e.g., autofeat 2.1.2]Note to Maintainers
Please consider updating the documentation or providing guidance on how to access the final model, features, and coefficients using the AutoFeatRegressor
class. This would greatly help users in understanding and utilizing the library effectively.
Thank you for your attention to this matter.
Minor point but it would be great if I could create my own dict of ureg objects and pass them instead of a s dict of strings that need parsed.
I know this is only for regression, but the approach which I am thinking of is, I fit_transform the entire data frame containing multi labels in 'y' column and feed the obtained data frame to some other model. will it work? or is there any other better approach? (provide code example if possible).
thanks.
afreg = AutoFeatRegressor(verbose=1, feateng_steps=3, n_jobs=24)
afreg.fit(x_train, y_train)
ValueError Traceback (most recent call last)
Input In [15], in <cell line: 2>()
1 afreg = AutoFeatRegressor(verbose=1, feateng_steps=1, n_jobs=24)
----> 2 afreg.fit(x_train, y_train)
File /app/conda/b2b-latest/lib/python3.8/site-packages/autofeat/autofeat.py:422, in AutoFeatModel.fit(self, X, y)
420 print("[AutoFeat] Warning: This just calls fit_transform() but does not return the transformed dataframe.")
421 print("[AutoFeat] It is much more efficient to call fit_transform() instead of fit() and transform()!")
--> 422 _ = self.fit_transform(X, y) # noqa
423 return self
File /app/conda/b2b-latest/lib/python3.8/site-packages/autofeat/autofeat.py:353, in AutoFeatModel.fit_transform(self, X, y)
351 else:
352 if self.problem_type in ("regression", "classification"):
--> 353 good_cols = select_features(
354 df_subs, target_sub, self.featsel_runs, None, self.problem_type, self.n_jobs, self.verbose
355 )
356 # if no features were selected, take the original features
357 if not good_cols:
File /app/conda/b2b-latest/lib/python3.8/site-packages/autofeat/featsel.py:245, in select_features(df, target, featsel_runs, keep, problem_type, n_jobs, verbose)
241 def flatten_lists(l: list):
242 return [item for sublist in l for item in sublist]
244 selected_columns = flatten_lists(
--> 245 Parallel(n_jobs=n_jobs, verbose=100 * verbose)(delayed(run_select_features)(i) for i in range(featsel_runs))
246 )
248 if selected_columns:
249 selected_columns = Counter(selected_columns)
File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:1944, in Parallel.call(self, iterable)
1938 # The first item from the output is blank, but it makes the interpreter
1939 # progress until it enters the Try/Except block of the generator and
1940 # reach the first yield
statement. This starts the aynchronous
1941 # dispatch of the tasks to the workers.
1942 next(output)
-> 1944 return output if self.return_generator else list(output)
File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:1587, in Parallel._get_outputs(self, iterator, pre_dispatch)
1584 yield
1586 with self._backend.retrieval_context():
-> 1587 yield from self._retrieve()
1589 except GeneratorExit:
1590 # The generator has been garbage collected before being fully
1591 # consumed. This aborts the remaining tasks if possible and warn
1592 # the user if necessary.
1593 self._exception = True
File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:1691, in Parallel._retrieve(self)
1684 while self._wait_retrieval():
1685
1686 # If the callback thread of a worker has signaled that its task
1687 # triggered an exception, or if the retrieval loop has raised an
1688 # exception (e.g. GeneratorExit
), exit the loop and surface the
1689 # worker traceback.
1690 if self._aborting:
-> 1691 self._raise_error_fast()
1692 break
1694 # If the next job is not ready for retrieval yet, we just wait for
1695 # async callbacks to progress.
File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:1726, in Parallel._raise_error_fast(self)
1722 # If this error job exists, immediatly raise the error by
1723 # calling get_result. This job might not exists if abort has been
1724 # called directly or if the generator is gc'ed.
1725 if error_job is not None:
-> 1726 error_job.get_result(self.timeout)
File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:735, in BatchCompletionCallBack.get_result(self, timeout)
729 backend = self.parallel._backend
731 if backend.supports_retrieve_callback:
732 # We assume that the result has already been retrieved by the
733 # callback thread, and is stored internally. It's just waiting to
734 # be returned.
--> 735 return self._return_or_raise()
737 # For other backends, the main thread needs to run the retrieval step.
738 try:
File /app/conda/b2b-latest/lib/python3.8/site-packages/joblib/parallel.py:753, in BatchCompletionCallBack._return_or_raise(self)
751 try:
752 if self.status == TASK_ERROR:
--> 753 raise self._result
754 return self._result
755 finally:
ValueError: Input X contains NaN.
LassoLarsCV does not accept missing values encoded as NaN natively.
will it work for logistic regression?
meaning classification tasks?
Hi, Drs,
X_train and y_train are Dataframe of array of Float64 when I use the following code:
alldata1= sio.loadmat(path)['input_train'].ravel()
x_train=pd.DataFrame(alldata1[folds])
x_train.columns=input_new_columns
alldata2= sio.loadmat(path)['output_train'].ravel()
y_train=pd.DataFrame(alldata2[folds])
y_train.columns=['targat']
alldata3= sio.loadmat(path)['input_test'].ravel()
x_test=pd.DataFrame(alldata3[folds])
x_test.columns=input_new_columns
alldata4= sio.loadmat(path)['output_test'].ravel()
y_test=pd.DataFrame(alldata4[folds])
y_test.columns=['targat']
alldata5= sio.loadmat(path)['input_vaildation'].ravel()
x_validation=pd.DataFrame(alldata5[folds])
x_validation.columns=input_new_columns
alldata6= sio.loadmat(path)['output_vaildation'].ravel()
y_validation=pd.DataFrame(alldata6[folds])
y_validation.columns=['targat']
afreg = AutoFeatClassifier(feateng_steps=2,categorical_cols=sparse_new_columns)
x_train_tr = afreg.fit_transform(x_train, y_train)
but I find an error:
UFuncTypeError ufunc '_lambdifygenerated' did not contain a loop with signature matching types (< class 'numpy.dtype[float32]'> , < class 'numpy.dtype[float32]'>) -> None
May I ask what causes this?
Hi,
not sure if this is a bug, but I thought the behavior was quite strange. I created some toy data, a set of independent variables x_i in [0, 1) and as target I compute y=x_0^2. The other x_i are noise variables.
Autofeat doesn't find the correct functional form in this case, instead it finds something like a*x_0 + b*1/x_0
,
for suitable values of a, b.
However, when I increase the range of x_0 from [0, 1) to [-1, 1), autofeat easily finds the correct functional form.
Here's my code for data generation:
def generate_toydata(n_samples, fun, x_min, x_max, noise_level=0.1):
x_range = x_max - x_min
# main feature
x_main = (np.random.rand(n_samples) * x_range) + x_min
# Generate 4 unrelated features
x_unrelated = np.random.rand(n_samples, 4)
y = fun(x_main) + np.random.normal(0, noise_level, n_samples)
X = np.column_stack((x_main, x_unrelated))
return X, y
X, y = generate_toydata(
n_samples=100,
fun=np.square,
x_min=0.0,
x_max=1.0,
noise_level=0.01
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, shuffle=False,
)
ImportError: cannot import name 'AutoFeatRegressor' from 'autofeat'
Autofeat provides only predict
for classification problems, it should also provide predict_proba
for use cases when we're not interested in 'hard prediction' but 'ranking'. I think it should be hard to expose it from underlying sklearn models.
Getting this error while using the benchmark of autofeat.AutoFeatClassifier
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[1], line 8
6 import numpy as np
7 import pandas as pd
----> 8 from autofeat import AutoFeatClassifier
9 from sklearn.datasets import load_breast_cancer, load_iris, load_wine
10 from sklearn.ensemble import RandomForestClassifier
File ~/my-conda-envs/fraud/lib/python3.8/site-packages/autofeat/__init__.py:6
4 name = "autofeat"
5 __version__ = "2.1.0"
----> 6 from .autofeatlight import AutoFeatLight # noqa
7 from .autofeat import AutoFeatModel, AutoFeatRegressor, AutoFeatClassifier # noqa
8 from .featsel import FeatureSelector # noqa
File ~/my-conda-envs/fraud/lib/python3.8/site-packages/autofeat/autofeatlight.py:56
51 print(sorted(useless_cols))
52 return [c for c in df.columns if c not in useless_cols]
55 def _compute_additional_features(
---> 56 X: np.ndarray, feature_names: list | None = None, compute_ratio: bool = True, compute_product: bool = True, verbose: int = 0
57 ) -> Tuple[np.ndarray, list]:
58 """
59 Compute additional non-linear features from the original features (ratio or product of two features).
60
(...)
69 - list with n_additional_features names describing the newly computed features
70 """
71 # check how many new features we will compute
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'
After updating your library, my public notebook Titanic - Autofeat (automatic FE) for selecting features for solving the task in the Kaggle Competition "Titanic - Machine Learning from Disaster" stopped working.
Early (a year ago) this code worked successfully (autofeat-0.2.5):
model = AutoFeatRegression()
X_train_feature_creation = model.fit_transform(train.to_numpy(), target.to_numpy().flatten())
X_test_feature_creation = model.transform(test.to_numpy())
Now I outputs an error (autofeat-2.0.4):
model = AutoFeatRegressor()
X_train_feature_creation = model.fit_transform(train.to_numpy(), target.to_numpy().flatten())
X_test_feature_creation = model.transform(test.to_numpy())
I've tried this as well:
X_train_feature_creation = model.fit_transform(train, target)
X_test_feature_creation = model.transform(test)
and:
model.fit(train, target)
I got the same error.
Thanks in advance for the advice on how to fix this.
i.e. does autofeat work on vaex or dask?
Hi!
While trying to use the AutoFeatClassifier using units, I stumbled upon a validation error caused by an infinite value.
Presumably one of the generated features (I assume from the ones coming from the Pi theorem) has an infinite value, which breaks the StandardScaler used during the filtering of correlated features.
This is how I am calling the classifier, with fitting the training data that comes in a numpy ndarray
auto = AutoFeatClassifier(categorical_cols=categorical_cols, units=units, verbose=1, feateng_steps=3, featsel_runs=5, n_jobs=5, apply_pi_theorem=True)
X_train_new = auto.fit_transform(X_train_sampled, y_train_sampled)
These are the features logged for the Pi Theorem, and all of them include divisions (which could lead to a division by 0 issue).
...
[AutoFeat] Applying the Pi Theorem
[AutoFeat] Pi Theorem 1: x002 / x001
[AutoFeat] Pi Theorem 2: x006 / x000
[AutoFeat] Pi Theorem 3: x010 / x005
[AutoFeat] Pi Theorem 4: x003 / x001
[AutoFeat] Pi Theorem 5: x013 / x001
[AutoFeat] Pi Theorem 6: x014 / x005
[AutoFeat] Pi Theorem 7: x000 * x005 * x012 / x015
[AutoFeat] Pi Theorem 8: x016 / x000
[AutoFeat] Pi Theorem 9: x017 / x012
...
The full logs output by a failing run is the following:
[AutoFeat] Applying the Pi Theorem
[AutoFeat] Pi Theorem 1: x002 / x001
[AutoFeat] Pi Theorem 2: x006 / x000
[AutoFeat] Pi Theorem 3: x007 / x005
[AutoFeat] Pi Theorem 4: x003 / x001
[AutoFeat] Pi Theorem 5: x009 / x001
[AutoFeat] Pi Theorem 6: x010 / x005
[AutoFeat] Pi Theorem 7: x000 * x005 * x008 / x011
[AutoFeat] Pi Theorem 8: x012 / x000
[AutoFeat] Pi Theorem 9: x013 / x008
[AutoFeat] The 3 step feature engineering process could generate up to 118923 features.
[AutoFeat] With 121 data points this new feature matrix would use about 0.06 gb of space.
[feateng] Step 1: transformation of original features
[feateng] Generated 40 transformed features from 14 original features - done.
[feateng] Step 2: first combination of features
[feateng] Generated 1524 feature combinations from 1431 original feature tuples - done.
[feateng] Step 3: transformation of new features
[feateng] Generated 4564 transformed features from 1524 original features - done.
[feateng] Generated altogether 6233 new features in 3 steps
[feateng] Removing correlated features, as well as additions at the highest level
And after that, the error is reported with the following stack trace:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-323-53dcdfc1b68e> in <module>
32 # categorical_cols = []
33 auto = AutoFeatClassifier(categorical_cols=categorical_cols, units=units, verbose=1, feateng_steps=3, featsel_runs=5, n_jobs=5, apply_pi_theorem=True)
---> 34 X_train_new = auto.fit_transform(X_train_sampled, y_train_sampled)
35 X_test_new = auto.transform(X_test.to_numpy())
36 pretty_names = feature_names(auto, USEFUL_ACTUALS)
~/.pyenv/versions/features/lib/python3.7/site-packages/autofeat/autofeat.py in fit_transform(self, X, y)
299 # generate features
300 df_subs, self.feature_formulas_ = engineer_features(df_subs, self.feateng_cols_, _parse_units(self.units, verbose=self.verbose),
--> 301 self.feateng_steps, self.transformations, self.verbose)
302 # select predictive features
303 if self.featsel_runs <= 0:
~/.pyenv/versions/features/lib/python3.7/site-packages/autofeat/feateng.py in engineer_features(df_org, start_features, units, max_steps, transformations, verbose)
354 if cols:
355 # check for correlated features again; this time with the start features
--> 356 corrs = dict(zip(cols, np.max(np.abs(np.dot(StandardScaler().fit_transform(df[cols]).T, StandardScaler().fit_transform(df_org))/df_org.shape[0]), axis=1)))
357 cols = [c for c in cols if corrs[c] < 0.9]
358 cols = list(df_org.columns) + cols
~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
688 if y is None:
689 # fit method of arity 1 (unsupervised transformation)
--> 690 return self.fit(X, **fit_params).transform(X)
691 else:
692 # fit method of arity 2 (supervised transformation)
~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y)
665 # Reset internal state before fitting
666 self._reset()
--> 667 return self.partial_fit(X, y)
668
669 def partial_fit(self, X, y=None):
~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y)
696 X = self._validate_data(X, accept_sparse=('csr', 'csc'),
697 estimator=self, dtype=FLOAT_DTYPES,
--> 698 force_all_finite='allow-nan')
699
700 # Even in the case of `with_mean=False`, we update the mean anyway
~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
418 f"requires y to be passed, but the target y is None."
419 )
--> 420 X = check_array(X, **check_params)
421 out = X
422 else:
~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
643 if force_all_finite:
644 _assert_all_finite(array,
--> 645 allow_nan=force_all_finite == 'allow-nan')
646
647 if ensure_min_samples > 0:
~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
97 msg_err.format
98 (type_err,
---> 99 msg_dtype if msg_dtype is not None else X.dtype)
100 )
101 # for object dtype data, we only check for NaNs (GH-13254)
ValueError: Input contains infinity or a value too large for dtype('float64').
I tried removing all the constant features from the original dataset, so that all the original features have std() > 0.
Looks like a feature generated has a division by zero somewhere that leads to an infinite value somewhere deep in the generated features.
Maybe there should be some handling there, ignoring the feature or replacing the infinites with NaN which the scalers know to ignore?
can you pls clarify what metric used for classification?
is it possible to use F1 metric for optimization, since data in real live is unbalanced
Hi? first of all, great work!!! you have just made something which is far more better than featuretools, manual feature engineering etc so congratulations!
I have two questions, the classifier and regressor, are they ensemble methods? can I know the maths behind them? and secondly after training it shows final score at the end, is it training score or cross validation?
can you share simple example to see this mentioned overfit
Please keep in mind that since the AutoFeatRegressor and AutoFeatClassifier models can generate very complex features, it will likely overfit on noise in the dataset, though the coefficients for features related to noise should be fairly small. It is generally suggested to carefully inspect the features found by autofeat and use those that make sense to you to train your own models
I ran autofeat on a set of fully numeric features with no nulls
using
afreg = AutoFeatClassifier(verbose=1, feateng_steps=2, max_gb=10, n_jobs=12)
ft_train_tr = afreg.fit_transform(ft_train, target)
the verbose logging showed:
[AutoFeat] The 2 step feature engineering process could generate up to 980700 features.
[AutoFeat] With 200000 data points this new feature matrix would use about 784.56 gb of space.
[AutoFeat] As you specified a limit of 10 gb, the number of data points is subsampled to 2549
[feateng] Step 1: transformation of original features
[feateng] Generated 915 transformed features from 200 original features - done.
[feateng] Step 2: first combination of features
[feateng] Generated 620258 feature combinations from 621055 original feature tuples - done.
[feateng] Generated altogether 621174 new features in 2 steps
[feateng] Removing correlated features, as well as additions at the highest level
then I got error:
AttributeErrorTraceback (most recent call last)
<ipython-input-10-5be3e98d3950> in <module>
1 afreg = AutoFeatClassifier(verbose=1, feateng_steps=2, max_gb=10, n_jobs=12)
----> 2 ft_train_tr = afreg.fit_transform(ft_train, target)
/opt/conda/lib/python3.8/site-packages/autofeat/autofeat.py in fit_transform(self, X, y)
298 target_sub = target.copy()
299 # generate features
--> 300 df_subs, self.feature_formulas_ = engineer_features(df_subs, self.feateng_cols_, _parse_units(self.units, verbose=self.verbose),
301 self.feateng_steps, self.transformations, self.verbose)
302 # select predictive features
/opt/conda/lib/python3.8/site-packages/autofeat/feateng.py in engineer_features(df_org, start_features, units, max_steps, transformations, verbose)
343 print("[feateng] Generated altogether %i new features in %i steps" % (len(feature_pool) - len(start_features), max_steps))
344 print("[feateng] Removing correlated features, as well as additions at the highest level")
--> 345 feature_pool = {c: feature_pool[c] for c in feature_pool if c in uncorr_features and not feature_pool[c].func == sympy.add.Add}
346 cols = [c for c in list(df.columns) if c in feature_pool and c not in df_org.columns] # categorical cols not in feature_pool
347 if cols:
/opt/conda/lib/python3.8/site-packages/autofeat/feateng.py in <dictcomp>(.0)
343 print("[feateng] Generated altogether %i new features in %i steps" % (len(feature_pool) - len(start_features), max_steps))
344 print("[feateng] Removing correlated features, as well as additions at the highest level")
--> 345 feature_pool = {c: feature_pool[c] for c in feature_pool if c in uncorr_features and not feature_pool[c].func == sympy.add.Add}
346 cols = [c for c in list(df.columns) if c in feature_pool and c not in df_org.columns] # categorical cols not in feature_pool
347 if cols:
AttributeError: module 'sympy' has no attribute 'add'
the ft_train
columns are all of type float16
the target is a pandas series with values array([0, 1])
any ideas what is wrong?
can you clarify if categorical features can be used for your classifier code?
and do you create combinatoric combinations for categorical features
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.