sberbank-ai-lab / lightautoml Goto Github PK

LAMA - automatic model creation framework

License: Apache License 2.0

Python 97.51% HTML 2.49%

automl classification regression multiclass pytorch nlp model-selection parameter-tuning automated-machine-learning lama

lightautoml's Introduction

LightAutoML - automatic model creation framework

LightAutoML (LAMA) is an AutoML framework by Sber AI Lab.

It provides automatic model creation for the following tasks:

binary classification
multiclass classification
regression

Current version of the package handles datasets that have independent samples in each row. I.e. each row is an object with its specific features and target. Multitable datasets and sequences are a work in progress :)

Note: we use AutoWoE library to automatically create interpretable models.

Authors: Alexander Ryzhkov, Anton Vakhrushev, Dmitry Simakov, Vasilii Bunakov, Rinchin Damdinov, Pavel Shvets, Alexander Kirilin.

Documentation of LightAutoML is available here, you can also generate it.

(New feature) GPU pipeline

Full GPU pipeline for LightAutoML currently available for developers testing (still in progress). The code and tutorials available here

Installation LightAutoML from PyPI
Quick tour
Resources
Contributing to LightAutoML
License
For developers
Support and feature requests

Installation

To install LAMA framework on your machine from PyPI, execute following commands:

# Install base functionality:

pip install -U lightautoml

# For partial installation use corresponding option.
# Extra dependecies: [nlp, cv, report]
# Or you can use 'all' to install everything

pip install -U lightautoml[nlp]

Additionaly, run following commands to enable pdf report generation:

# MacOS
brew install cairo pango gdk-pixbuf libffi

# Debian / Ubuntu
sudo apt-get install build-essential libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev shared-mime-info

# Fedora
sudo yum install redhat-rpm-config libffi-devel cairo pango gdk-pixbuf2

# Windows
# follow this tutorial https://weasyprint.readthedocs.io/en/stable/install.html#windows

Quick tour

Let's solve the popular Kaggle Titanic competition below. There are two main ways to solve machine learning problems using LightAutoML:

Use ready preset for tabular data:

import pandas as pd
from sklearn.metrics import f1_score

from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task

df_train = pd.read_csv('../input/titanic/train.csv')
df_test = pd.read_csv('../input/titanic/test.csv')

automl = TabularAutoML(
    task = Task(
        name = 'binary',
        metric = lambda y_true, y_pred: f1_score(y_true, (y_pred > 0.5)*1))
)
oof_pred = automl.fit_predict(
    df_train,
    roles = {'target': 'Survived', 'drop': ['PassengerId']}
)
test_pred = automl.predict(df_test)

pd.DataFrame({
    'PassengerId':df_test.PassengerId,
    'Survived': (test_pred.data[:, 0] > 0.5)*1
}).to_csv('submit.csv', index = False)

LighAutoML framework has a lot of ready-to-use parts and extensive customization options, to learn more check out the resources section.

Resources

Kaggle kernel examples of LightAutoML usage:

Google Colab tutorials and other examples:

Tutorial_1_basics.ipynb - get started with LightAutoML on tabular data.
Tutorial_2_WhiteBox_AutoWoE.ipynb - creating interpretable models.
Tutorial_3_sql_data_source.ipynb - shows how to use LightAutoML presets (both standalone and time utilized variants) for solving ML tasks on tabular data from SQL data base instead of CSV.
Tutorial_4_NLP_Interpretation.ipynb - example of using TabularNLPAutoML preset, LimeTextExplainer.
Tutorial_5_uplift.ipynb - shows how to use LightAutoML for a uplift-modeling task.
Tutorial_6_custom_pipeline.ipynb - shows how to create your own pipeline from specified blocks: pipelines for feature generation and feature selection, ML algorithms, hyperparameter optimization etc.
Tutorial_7_ICE_and_PDP_interpretation.ipynb - shows how to obtain local and global interpretation of model results using ICE and PDP approaches.

Note 1: for production you have no need to use profiler (which increase work time and memory consomption), so please do not turn it on - it is in off state by default

Note 2: to take a look at this report after the run, please comment last line of demo with report deletion command.

Courses, videos and papers

LightAutoML crash courses:
- (Russian) AutoML course for OpenDataScience community
Video guides:
- (Russian) LightAutoML webinar for Sberloga community (Alexander Ryzhkov, Dmitry Simakov)
- (Russian) LightAutoML hands-on tutorial in Kaggle Kernels (Alexander Ryzhkov)
- (English) Automated Machine Learning with LightAutoML: theory and practice (Alexander Ryzhkov)
- (English) LightAutoML framework general overview, benchmarks and advantages for business (Alexander Ryzhkov)
- (English) LightAutoML practical guide - ML pipeline presets overview (Dmitry Simakov)
Papers:
- Anton Vakhrushev, Alexander Ryzhkov, Dmitry Simakov, Rinchin Damdinov, Maxim Savchenko, Alexander Tuzhilin "LightAutoML: AutoML Solution for a Large Financial Services Ecosystem". arXiv:2109.01528, 2021.
Articles about LightAutoML:
- (English) LightAutoML vs Titanic: 80% accuracy in several lines of code (Medium)
- (English) Hands-On Python Guide to LightAutoML – An Automatic ML Model Creation Framework (Analytic Indian Mag)

Contributing to LightAutoML

If you are interested in contributing to LightAutoML, please read the Contributing Guide to get started.

License

This project is licensed under the Apache License, Version 2.0. See LICENSE file for more details.

For developers

Installation from source code

First of all you need to install git and poetry.

# Load LAMA source code
git clone https://github.com/sberbank-ai-lab/LightAutoML.git

cd LightAutoML/

# !!!Choose only one item!!!

# 1. Global installation: Don't create virtual environment
poetry config virtualenvs.create false --local

# 2. Recommended: Create virtual environment inside your project directory
poetry config virtualenvs.in-project true

# For more information read poetry docs

# Install LAMA
poetry lock
poetry install

Build your own custom pipeline:

import pandas as pd
from sklearn.metrics import f1_score

from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task

df_train = pd.read_csv('../input/titanic/train.csv')
df_test = pd.read_csv('../input/titanic/test.csv')

# define that machine learning problem is binary classification
task = Task("binary")

reader = PandasToPandasReader(task, cv=N_FOLDS, random_state=RANDOM_STATE)

# create a feature selector
model0 = BoostLGBM(
    default_params={'learning_rate': 0.05, 'num_leaves': 64,
    'seed': 42, 'num_threads': N_THREADS}
)
pipe0 = LGBSimpleFeatures()
mbie = ModelBasedImportanceEstimator()
selector = ImportanceCutoffSelector(pipe0, model0, mbie, cutoff=0)

# build first level pipeline for AutoML
pipe = LGBSimpleFeatures()
# stop after 20 iterations or after 30 seconds
params_tuner1 = OptunaTuner(n_trials=20, timeout=30)
model1 = BoostLGBM(
    default_params={'learning_rate': 0.05, 'num_leaves': 128,
    'seed': 1, 'num_threads': N_THREADS}
)
model2 = BoostLGBM(
    default_params={'learning_rate': 0.025, 'num_leaves': 64,
    'seed': 2, 'num_threads': N_THREADS}
)
pipeline_lvl1 = MLPipeline([
    (model1, params_tuner1),
    model2
], pre_selection=selector, features_pipeline=pipe, post_selection=None)

# build second level pipeline for AutoML
pipe1 = LGBSimpleFeatures()
model = BoostLGBM(
    default_params={'learning_rate': 0.05, 'num_leaves': 64,
    'max_bin': 1024, 'seed': 3, 'num_threads': N_THREADS},
    freeze_defaults=True
)
pipeline_lvl2 = MLPipeline([model], pre_selection=None, features_pipeline=pipe1,
 post_selection=None)

# build AutoML pipeline
automl = AutoML(reader, [
    [pipeline_lvl1],
    [pipeline_lvl2],
], skip_conn=False)

# train AutoML and get predictions
oof_pred = automl.fit_predict(df_train, roles = {'target': 'Survived', 'drop': ['PassengerId']})
test_pred = automl.predict(df_test)

pd.DataFrame({
    'PassengerId':df_test.PassengerId,
    'Survived': (test_pred.data[:, 0] > 0.5)*1
}).to_csv('submit.csv', index = False)

Support and feature requests

Seek prompt advice at Slack community or Telegram group.

Open bug reports and feature requests on GitHub issues.

lightautoml's People

Contributors

Stargazers

Watchers

Forkers

anhmike jingmouren olgaceban goncaloperes cryptonome viai957 tesemnikov-av denatns alexeytomashyov thilakraj1998 burlakovsber shyoon-seegene getar sts0mrg0 china-challengehub tempcat07 sergeymopozov gridl smokemasterd earlbabson ai-maxim zergey mwaiton zhurik sandy4321 unclemokus vitaly-lv crustaceanj vinayakvaid parusss vivanvatsa jandjane oubush resivalex nonflame ds-agent7 salivonaa riyasql bizzyvinci fdoperezi stabuev tony20202021 andrey-nikitin mikhailkuz irina-64 fenghaolinroix desimakov nikitavoblikov shashanksharad raalab kiotobo greenplace physci nikolaifrolov dev-rinchin ukhta antigab kotikkonstantin salaxieb valeman eli-osherovich kevinresearch kobylkinks hparmar12 burkovba mjvakili sonyeric mldovakin enriczhang o7s8r6 my1stk8s sumitjha406 bindichen ahmedopolis jaywhite995 trx666 ameeransari sainpse ekonyagin magicspirit215 spvasilev vap0rsh4wn rnaimehaom omvishal1 andylao-ai gg-big-org madquirk-hash strogo garotar andromedasmart tdl77

lightautoml's Issues

The kernel appears to have died at the `CatBoost` fitting stage if incorrect `gpu_id` passed

If I have only two available GPUs, I can still pass gpu_id='2' to the TabularAutoML without getting an error.

nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Tesla V1...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   44C    P0    36W / 250W |   1581MiB / 32510MiB |     13%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA Tesla V1...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   47C    P0    47W / 250W |   1249MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Code:

automl = TabularAutoML(
    task=Task('multiclass', ...),
    ...,
    general_params={'use_algos': [['linear_l2', 'cb', 'cb_tuned']]},
    cb_params={'default_params': {'thread_count': 48}},
    gpu_ids='2',
    verbose=3
)
oof_predictions = automl.fit_predict(
    train,
    roles=...
)

The kernel appears to have died at the fit_predict stage when it tries to train the CatBoost model.

Changing gpu_ids='2' to gpu_ids='1' solves the problem.

Please add the corresponding check to the __init__ of the TabularAutoML.

Проблемы с установкой либы

При установке на mac и WSL на питон 3.8 ругается при установке opencv.
На mac на питон 3.6 встала, но ругается при импорте

  Referenced from: /Users/tomashev-aa/lightautoml/lama-venv/lib/python3.6/site-packages/lightgbm/lib_lightgbm.so
  Reason: image not found

На WSL на питон 3.6 встала и работает)

AutoML benchmark - where I can find results

Hi
I've just watched https://vimeo.com/485383651 - great work.
There is the link to https://github.com/sberbank-ai-lab/automlbenchmark, I've looked at couple of scripts and different branches but I can't find results like charts from presentation.
Is it necessarry to run benchmarks by yourselft or I didn't noticed it. It'd be great to have link to results/comparison directly in documentation or main git page.

reproducibility

Hello everyone.

While using the library, we encountered a couple of cases in which we get different predicates on the same data with the same model configuration.

The first case
We train the model with these settings:

roles = {'target': 'label', 'text': ['text']}
task = Task('binary', metric='auc')

automl = TabularNLPAutoML(task=task,
    timeout=100000,
    general_params={'use_algos': ['nn', 'cb', 'lgb', 'linear_l2']},
    gpu_ids='0',
    reader_params={'n_jobs': 12},
    cpu_limit=13,
    text_params={'lang': 'ru'},

    nn_params={
        'lang': 'ru',
        'snap_params': {'k': 1, 'early_stopping': True, 'patience': 1, 'swa': False},
        'max_length': 256,
        'bs': 16,
        'bert_name': 'DeepPavlov/rubert-base-cased-conversational',
        'pooling': 'cls' },

    nn_pipeline_params={'text_features': 'bert'},
    autonlp_params={'model_name': 'random_lstm_bert'},
    gbm_pipeline_params={'text_features': 'embed'}, # tfidf embed
    linear_pipeline_params={'text_features': 'embed'},
    verbose=2
)

We predict on the test data and get the result # 1:

def to_labels(pos_probs, threshold):
    return (pos_probs >= threshold). astype ('int`)'

test_pred = automl.predict(test_pd)
labels = to_labels(test_pred.data[:, 0], 0.5)
print(classification_report(test_pd[roles['target']].values, labels, digits=4))

We repeat the training again, we get result # 2 on the test data, while the result is # 1 != result # 2

Please tell me what this behavior can be related to?

LabelEncoder filtering is not working

https://github.com/sberbank-ai-lab/LightAutoML/blob/master/lightautoml/transformers/categorical.py#L167

I suppose this line contains an error, because this line tries to filter an entire dataframe using integer value.

I have copy-pasted this code fragment and added few prints:

As we are able to see, there is no filtering.

If we fix it by adding a column to the dataframe, it will work well.

'bert' for feature in the gbm pipeline

Hello

Thank for your work.

Can we use 'bert' for feature in the gbm pipeline?

gbm_pipeline_params={'text_features': 'bert'}

DateSeasons transformer works wrong

https://github.com/sberbank-ai-lab/LightAutoML/blob/master/lightautoml/transformers/datetime.py#L272

The problem is in this line.
Method holidays.CountryHoliday returns a dict of datetime.date : str, but pandas dataframe may contain datetime.datetime types, thus it gives wrong result. It tries to compare datetime and date, of course it gives "False".

To fix it we have to add .dt.date attributes to pandas Series and it will work fine.

Error import in Tutorial notebooks

Probably one or more models failed

Hi,
I am trying out LAMA, and keep bumping into this message:

Model Lvl_0_Pipe_2_Mod_0_TorchNN failed during ml_algo.fit_predict call.

only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Traceback (most recent call last):
  File "/home/dev/LAMAtest/AutoML/LAMA_train_test.py", line 38, in <module>
    oof_pred = automl.fit_predict(train, roles=roles)
  File "/home/dev/LAMAtest/venv/lib/python3.8/site-packages/log_calls/log_calls.py", line 1939, in _deco_base_f_wrapper_
    ret = f(*args, **kwargs)
  File "/home/dev/LAMAtest/venv/lib/python3.8/site-packages/lightautoml/automl/presets/tabular_presets.py", line 409, in fit_predict
    oof_pred = super().fit_predict(train, roles=roles, cv_iter=cv_iter, valid_data=valid_data)
  File "/home/dev/LAMAtest/venv/lib/python3.8/site-packages/log_calls/log_calls.py", line 1939, in _deco_base_f_wrapper_
    ret = f(*args, **kwargs)
  File "/home/dev/LAMAtest/venv/lib/python3.8/site-packages/lightautoml/automl/presets/base.py", line 176, in fit_predict
    result = super().fit_predict(train_data, roles, train_features, cv_iter, valid_data, valid_features)
  File "/home/dev/LAMAtest/venv/lib/python3.8/site-packages/log_calls/log_calls.py", line 1939, in _deco_base_f_wrapper_
    ret = f(*args, **kwargs)
  File "/home/dev/LAMAtest/venv/lib/python3.8/site-packages/lightautoml/automl/base.py", line 186, in fit_predict
    pipe_pred = ml_pipe.fit_predict(train_valid)
  File "/home/dev/LAMAtest/venv/lib/python3.8/site-packages/log_calls/log_calls.py", line 1939, in _deco_base_f_wrapper_
    ret = f(*args, **kwargs)
  File "/home/dev/LAMAtest/venv/lib/python3.8/site-packages/lightautoml/pipelines/ml/base.py", line 131, in fit_predict
    assert len(predictions) > 0, 'Pipeline finished with 0 models for some reason.\nProbably one or more models failed'
AssertionError: Pipeline finished with 0 models for some reason.
Probably one or more models failed

I do not understand what I am supposed to do and where to dig.
Did it fail to produce a model? Why? Is it because the data is bad? Or I did something wrong?
Please advise.

save model

how to save and load model config or weight?

Cannot use mae as loss in regression task

I'm using LightAutoML 0.2.16. When I define a new task with
task = Task('reg', loss='mae')
I get
sklearn doesn't support in general case mae and will not be used.
Btw set metric='mae' works fine for me

providing CustomIterator to cv_iter in tabular_automl.fit_predict fails

I would like to pass a custom cross validation to cv_iter in tabular_automl.fit_predict or automl.fit_predict. I run into this error:
AssertionError: Pipeline finished with 0 models for some reason.

to give you some context, I have my own custom cv class which has a split method (just like any sklearn cv) that yields train_indices , valid_indices for each split and I pass it to tabular_automl.fit_predict() in the following way:

from lightautoml.validation.base import CustomIterator
cv_splitter = cv.split(tr_data)
custom_iterator = CustomIterator(tr_data, cv_splitter)
tabular_automl.fit_predict(tr_data, roles={'target' : TARGET_NAME}, cv_iter = custom_cv_iterator)

I run into the same error if I use:

tabular_automl.fit_predict(tr_data, roles={'target' : TARGET_NAME}, cv_iter = cv_splitter)

Do you know what may have caused this error and how to fix it? Thanks!

[Question] How to determine to which class each probability column belongs?

Is it possible to determine to which class each probability columns belongs?

Consider the following example:

import numpy as np

x = np.random.random((150, 4))
y = np.asarray(list("abc") * 50).reshape(-1, 1)

import pandas as pd

data = pd.DataFrame(np.hstack([x, y]), columns=["f1", "f2", "f3", "f4", "target"])
print(data.head())

from lightautoml.tasks import Task
from lightautoml.automl.presets.tabular_presets import TabularUtilizedAutoML

task = Task("multiclass")
automl = TabularUtilizedAutoML(task=task, timeout=30)

automl.fit_predict(data, roles=dict(target="target"))

preds = automl.predict(data[["f1", "f2", "f3", "f4"]])

The resulting preds is a NumpyDataset with data shape (150,3) representing features WeightedBlend_{0,1,2}. Nowhere in the meta-data can I find whether the first column probabilities correspond to class 'a' (or any other class). Am I missing something here?

As far as I can tell the column order depends on the class order in the original training data. But I can't find this explicitly anywhere, nor can I find a progamatic way of retrieving the order of labels as used by lightautoml. I would expect e.g. a classes_ property or the feature names to reflect the classes of which the probability is predicted in each column.

Save the trained models

Hi,

I have a quick question. How do we save the trained lightautoml model? After reading the documentation, it seems to me that there is no way for us to save the model without using things like pickle.

Is there a description of the automl method?

I was wondering if there is good documentation about the AutoML process behind LAMA.
I watched the general overview video, but it's still not clear how exactly optimization works internally.
E.g. what optimization methods are used? Are only linear models and LightGBM considered? What kind of feature engineering is used, and is it static (same strategy for each dataset) or dynamic (LAMA explores various approaches to see what works on the data).

If possible, a written source is best (ideally a paper, but a documentation page works too).

Introduction of CI

Would you like to consider introducing CI? It realizes automation such as testing.

I recommend the GitHub Actions.

Feature names that inherited from string cause an exception

I got a dataframe by pandas.read_sql_table. Every column name has a type sqlalchemy.sql.elements.quoted_name. Many checks in the source code rely on str, so fit_predict fails.

Should the library notify developers about unsupported column name types, convert them to strings or allow column names to have any type?

NSections Issue with Train Dataset

When I inserted my training df into .fit_predict(), I receive the initiation:

- time: 72000 seconds
- cpus: 4 cores
- memory: 16 gb

Train data shape: (17290, 17455)

but then the process fails with error:

C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\shape_base.py in array_split(ary, indices_or_sections, axis)
    771         # handle array case.
--> 772         Nsections = len(indices_or_sections) + 1
    773         div_points = [0] + list(indices_or_sections) + [Ntotal]

TypeError: object of type 'int' has no len()

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-134-bcf87a9972bc> in <module>
      5                        #general_params={'use_algos': [['lgb', 'cb', 'LinearLBFGS', 'linear_l1', 'xgb']]}
      6                       )
----> 7 oof_pred = automl.fit_predict(newTrainDum, roles=roles)

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\automl\presets\tabular_presets.py in fit_predict(self, train_data, roles, train_features, cv_iter, valid_data, valid_features)
    411             data, _ = read_data(valid_data, valid_features, self.cpu_limit, self.read_csv_params)
    412 
--> 413         oof_pred = super().fit_predict(train, roles=roles, cv_iter=cv_iter, valid_data=valid_data)
    414 
    415         return cast(NumpyDataset, oof_pred)

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\automl\presets\base.py in fit_predict(self, train_data, roles, train_features, cv_iter, valid_data, valid_features)
    171         logger.info('- memory: {} gb\n'.format(self.memory_limit))
    172         self.timer.start()
--> 173         result = super().fit_predict(train_data, roles, train_features, cv_iter, valid_data, valid_features)
    174         logger.info('\nAutoml preset training completed in {:.2f} seconds.'.format(self.timer.time_spent))
    175 

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\automl\base.py in fit_predict(self, train_data, roles, train_features, cv_iter, valid_data, valid_features)
    155         """
    156         self.timer.start()
--> 157         train_dataset = self.reader.fit_read(train_data, train_features, roles)
    158 
    159         assert len(self._levels) <= 1 or train_dataset.folds is not None, \

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\reader\base.py in fit_read(self, train_data, features_names, roles, **kwargs)
    323         dataset = PandasDataset(train_data[self.used_features], self.roles, task=self.task, **kwargs)
    324         if self.advanced_roles:
--> 325             new_roles = self.advanced_roles_guess(dataset, manual_roles=parsed_roles)
    326             droplist = [x for x in new_roles if new_roles[x].name == 'Drop' and not self._roles[x].force_input]
    327             self.upd_used_features(remove=droplist)

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\reader\base.py in advanced_roles_guess(self, dataset, manual_roles)
    492         # guess roles nor numerics
    493 
--> 494         stat = get_numeric_roles_stat(dataset, manual_roles=manual_roles,
    495                                       random_state=self.random_state,
    496                                       subsample=self.samples,

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\reader\guess_roles.py in get_numeric_roles_stat(train, subsample, random_state, manual_roles, n_jobs)
    263 
    264     # check scores as is
--> 265     res['raw_scores'] = get_score_from_pipe(train, target, empty_slice=empty_slice, n_jobs=n_jobs)
    266 
    267     # check unique values

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\reader\guess_roles.py in get_score_from_pipe(train, target, pipe, empty_slice, n_jobs)
    192         return _get_score_from_pipe(train, target, pipe, empty_slice)
    193 
--> 194     idx = np.array_split(np.arange(shape[1]), n_jobs)
    195     idx = [x for x in idx if len(x) > 0]
    196     n_jobs = len(idx)

<__array_function__ internals> in array_split(*args, **kwargs)

C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\shape_base.py in array_split(ary, indices_or_sections, axis)
    776         Nsections = int(indices_or_sections)
    777         if Nsections <= 0:
--> 778             raise ValueError('number sections must be larger than 0.')
    779         Neach_section, extras = divmod(Ntotal, Nsections)
    780         section_sizes = ([0] +

ValueError: number sections must be larger than 0.

Any help to adjust my df would be greatly appreciated. I did try to convert to array, but then the target param can't be found (which makes sense).

Thanks in advance!

RMSLE metric issue

I came across a side effect of RMSE metric - it could not be calculated for negative values. And as a result a whole training pipeline fails.

The issues can be reproduced on the "Used Card Price" dataset and the script below. I'm pretty sure that target values do not have any negative values, therefore the error arises because of negative predictions of linear models.
It would be nice to catch and safely eliminate this problem.

import pandas as pd
from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task

def main():
    df = pd.read_csv('../data/car-price-train.csv')

    roles = {'target': 'Price',
             'drop': ['Year']
             }

    task = Task('reg', metric="rmsle")

    automl = TabularAutoML(task=task, gpu_ids='', timeout=10000000000000)

    automl.fit_predict(df, roles=roles, verbose=5)


if __name__ == '__main__':
    main()

Stack trace is

[11:54:51] Stdout logging level is DEBUG.
[11:54:51] Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer
[11:54:51] Task: reg

[11:54:51] Start automl preset with listed constraints:
[11:54:51] - time: 10000000000000.00 seconds
[11:54:51] - CPU: 4 cores
[11:54:51] - memory: 16 GB

[11:54:51] Train data shape: (6019, 14)

[11:54:57] Feats was rejected during automatic roles guess: []
[11:54:57] Layer 1 train process start. Time left 9999999999993.55 secs
[11:54:58] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[11:54:58] Training params: {'tol': 1e-06, 'max_iter': 100, 'cs': [1e-05, 5e-05, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000], 'early_stopping': 2, 'categorical_idx': [20, 21, 22], 'embed_sizes': array([11, 12,  3], dtype=int32), 'data_size': 23}
[11:54:58] ===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====
[11:54:58] Linear model: C = 1e-05 score = -0.5098838547322487
[11:54:58] Linear model: C = 5e-05 score = -0.41608116270993006
[11:54:58] Model Lvl_0_Pipe_0_Mod_0_LinearL2 failed during ml_algo.fit_predict call.

Mean Squared Logarithmic Error cannot be used when targets contain negative values.
Traceback (most recent call last):
  File "/Users/user/projects/LAML_dev/work/RMSLE_issue.py", line 21, in <module>
    main()
  File "/Users/user/projects/LAML_dev/work/RMSLE_issue.py", line 17, in main
    automl.fit_predict(df, roles=roles, verbose=5)
  File "/Users/user/projects/LAML_dev/LightAutoML/lightautoml/automl/presets/tabular_presets.py", line 525, in fit_predict
    train, roles=roles, cv_iter=cv_iter, valid_data=valid_data, verbose=verbose
  File "/Users/user/projects/LAML_dev/LightAutoML/lightautoml/automl/presets/base.py", line 211, in fit_predict
    verbose,
  File "/Users/user/projects/LAML_dev/LightAutoML/lightautoml/automl/base.py", line 225, in fit_predict
    pipe_pred = ml_pipe.fit_predict(train_valid)
  File "/Users/user/projects/LAML_dev/LightAutoML/lightautoml/pipelines/ml/base.py", line 150, in fit_predict
    ), "Pipeline finished with 0 models for some reason.\nProbably one or more models failed"
AssertionError: Pipeline finished with 0 models for some reason.
Probably one or more models failed

Process finished with exit code 1

"AttributeError: type object 'Callable' has no attribute '_abc_registry'" during installation

Hello! During the installation of lightautoml (I did it in Docker), I had the following error:

#10 [6/9] RUN pip install lightautoml
#10 1.140 Collecting lightautoml
#10 1.392   Downloading LightAutoML-0.2.5-py3-none-any.whl (230 kB)
#10 1.920 Collecting json2html
#10 1.966   Downloading json2html-1.3.0.tar.gz (7.0 kB)
#10 2.259 Requirement already satisfied: nltk in /home/user/conda/lib/python3.7/site-packages (from lightautoml) (3.5)
#10 2.278 Requirement already satisfied: IPython in /home/user/conda/lib/python3.7/site-packages (from lightautoml) (7.19.0)
#10 2.316 Requirement already satisfied: torch in /home/user/conda/lib/python3.7/site-packages (from lightautoml) (1.3.0+cu100)
#10 2.407 Collecting pandoc
#10 2.461   Downloading pandoc-1.0.2.tar.gz (488 kB)
#10 3.466 Collecting sphinx-autodoc-typehints
#10 3.510   Downloading sphinx_autodoc_typehints-1.11.1-py3-none-any.whl (8.7 kB)
#10 3.560 Requirement already satisfied: scikit-learn in /home/user/conda/lib/python3.7/site-packages (from lightautoml) (0.23.2)
#10 3.662 Collecting optuna
#10 3.709   Downloading optuna-2.3.0.tar.gz (258 kB)
#10 3.975   Installing build dependencies: started
#10 4.383   Installing build dependencies: finished with status 'error'
#10 4.383   ERROR: Command errored out with exit status 1:
#10 4.383    command: /home/user/conda/bin/python3.7 /home/user/conda/lib/python3.7/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-4m56urr5/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- 'setuptools>=40.8.0' wheel
#10 4.383        cwd: None
#10 4.383   Complete output (44 lines):
#10 4.383   Traceback (most recent call last):
#10 4.383     File "/home/user/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
#10 4.383       "__main__", mod_spec)
#10 4.383     File "/home/user/conda/lib/python3.7/runpy.py", line 85, in _run_code
#10 4.383       exec(code, run_globals)
#10 4.383     File "/home/user/conda/lib/python3.7/site-packages/pip/__main__.py", line 26, in <module>
#10 4.383       sys.exit(_main())
#10 4.383     File "/home/user/conda/lib/python3.7/site-packages/pip/_internal/cli/main.py", line 73, in main
#10 4.383       command = create_command(cmd_name, isolated=("--isolated" in cmd_args))
#10 4.383     File "/home/user/conda/lib/python3.7/site-packages/pip/_internal/commands/__init__.py", line 104, in create_command
#10 4.383       module = importlib.import_module(module_path)
#10 4.383     File "/home/user/conda/lib/python3.7/importlib/__init__.py", line 127, in import_module
#10 4.383       return _bootstrap._gcd_import(name[level:], package, level)
#10 4.383     File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
#10 4.383     File "<frozen importlib._bootstrap>", line 983, in _find_and_load
#10 4.383     File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
#10 4.383     File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
#10 4.383     File "<frozen importlib._bootstrap_external>", line 728, in exec_module
#10 4.383     File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
#10 4.383     File "/home/user/conda/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 24, in <module>
#10 4.383       from pip._internal.cli.req_command import RequirementCommand, with_cleanup
#10 4.383     File "/home/user/conda/lib/python3.7/site-packages/pip/_internal/cli/req_command.py", line 16, in <module>
#10 4.383       from pip._internal.index.package_finder import PackageFinder
#10 4.383     File "/home/user/conda/lib/python3.7/site-packages/pip/_internal/index/package_finder.py", line 21, in <module>
#10 4.383       from pip._internal.index.collector import parse_links
#10 4.383     File "/home/user/conda/lib/python3.7/site-packages/pip/_internal/index/collector.py", line 14, in <module>
#10 4.383       from pip._vendor import html5lib, requests
#10 4.383     File "/home/user/conda/lib/python3.7/site-packages/pip/_vendor/requests/__init__.py", line 114, in <module>
#10 4.383       from . import utils
#10 4.383     File "/home/user/conda/lib/python3.7/site-packages/pip/_vendor/requests/utils.py", line 25, in <module>
#10 4.383       from . import certs
#10 4.383     File "/home/user/conda/lib/python3.7/site-packages/pip/_vendor/requests/certs.py", line 15, in <module>
#10 4.383       from pip._vendor.certifi import where
#10 4.383     File "/home/user/conda/lib/python3.7/site-packages/pip/_vendor/certifi/__init__.py", line 1, in <module>
#10 4.383       from .core import contents, where
#10 4.383     File "/home/user/conda/lib/python3.7/site-packages/pip/_vendor/certifi/core.py", line 12, in <module>
#10 4.383       from importlib.resources import read_text
#10 4.383     File "/home/user/conda/lib/python3.7/importlib/resources.py", line 11, in <module>
#10 4.383       from typing import Iterable, Iterator, Optional, Set, Union   # noqa: F401
#10 4.383     File "/home/user/conda/lib/python3.7/site-packages/typing.py", line 1359, in <module>
#10 4.383       class Callable(extra=collections_abc.Callable, metaclass=CallableMeta):
#10 4.383     File "/home/user/conda/lib/python3.7/site-packages/typing.py", line 1007, in __new__
#10 4.383       self._abc_registry = extra._abc_registry
#10 4.383   AttributeError: type object 'Callable' has no attribute '_abc_registry'

I suppose it's an issue with typing package as described here: https://stackoverflow.com/questions/55833509/attributeerror-type-object-callable-has-no-attribute-abc-registry. If anybody has the same issue, notice that I solved this problem by pip uninstall typing, and then pip install lightautoml worked just fine.

Problem with create_model_str_desc

I tried to get model parameters using create_model_str_desc() but got this error: AttributeError: 'TabularAutoML' object has no attribute 'create_model_str_desc'
I use this notebook as example: https://www.kaggle.com/alexryzhkov/aug21-lightautoml-starter

ReportDeco does not work on all the AutoML objects

I can not wrap AutoML from the tutorial into ReportDeco wrapper in Tutorial 3.

After

automl = AutoML(reader=reader, levels=[
    [gbm_lvl0, reg_lvl0]
], timer=timer, blender=blender, skip_conn=False)

adding

from lightautoml.report import ReportDeco
RD = ReportDeco(output_path='./binary_report/')
automl = RD(automl)

gives error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-22-7df350cbcb2a> in <module>()
      1 from lightautoml.report import ReportDeco
      2 RD = ReportDeco(output_path='./binary_report/')
----> 3 automl = RD(automl)

/usr/local/lib/python3.7/dist-packages/lightautoml/report/report_deco.py in __call__(self, model)
    375 
    376         # AutoML only
--> 377         self.task = self._model.task._name  # valid_task_names = ['binary', 'reg', 'multiclass']
    378 
    379         # add informataion to report

AttributeError: 'AutoML' object has no attribute 'task'

Looks like ReportDeco works only with tabular preset.

Reports in multiclass models are not working

lightautoml version: 0.2.14
Expected behaviour: model is trained and report is created.
Actual behaviour:

automl preset training completed in 18.18 seconds.
Traceback (most recent call last):
  File "main.py", line 70, in <module>
    main()
  File "main.py", line 64, in main
    oof_pred = automl.fit_predict(train_data, roles=roles)
  File "/Users/aleksandr/Projects/report-test/.venv/lib/python3.8/site-packages/lightautoml/report/report_deco.py", line 524, in fit_predict
    summary = self._multiclass_details(data)
  File "/Users/aleksandr/Projects/report-test/.venv/lib/python3.8/site-packages/lightautoml/report/report_deco.py", line 433, in _multiclass_details
    classes = sorted(self.mapping, key=self.mapping.get)
AttributeError: 'NoneType' object has no attribute 'get'

Code:

import pandas as pd
from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.report.report_deco import ReportDeco
from lightautoml.dataset.roles import *
from lightautoml.tasks import Task


def main():
    data = pd.read_csv("./train_small.csv", decimal=".")
    train_data = data

    roles = {'target': "Survived"}

    task = Task(
        "multiclass",
        loss=None,
        loss_params=None,
        metric="accuracy",
        metric_params=None,
        greater_is_better=True
    )

    automl = TabularAutoML(task=task, timeout=1200)
    automl = ReportDeco(
        output_path="./rep2",
        report_file_name="report"
    )(automl)

    oof_pred = automl.fit_predict(train_data, roles=roles)

if __name__ == "__main__":
    main()

Dataset:

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,30.0708,,C
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4,1,1,PP 9549,16.7,G6,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58,0,0,113783,26.55,C103,S
13,0,3,"Saundercock, Mr. William Henry",male,20,0,0,A/5. 2151,8.05,,S
14,0,3,"Andersson, Mr. Anders Johan",male,39,1,5,347082,31.275,,S
15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14,0,0,350406,7.8542,,S
16,1,2,"Hewlett, Mrs. (Mary D Kingcome) ",female,55,0,0,248706,16,,S
17,0,3,"Rice, Master. Eugene",male,2,4,1,382652,29.125,,Q
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13,,S
19,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)",female,31,1,0,345763,18,,S
20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C
21,0,2,"Fynney, Mr. Joseph J",male,35,0,0,239865,26,,S
22,1,2,"Beesley, Mr. Lawrence",male,34,0,0,248698,13,D56,S
23,1,3,"McGowan, Miss. Anna ""Annie""",female,15,0,0,330923,8.0292,,Q
24,1,1,"Sloper, Mr. William Thompson",male,28,0,0,113788,35.5,A6,S
25,0,3,"Palsson, Miss. Torborg Danira",female,8,3,1,349909,21.075,,S
26,1,3,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)",female,38,1,5,347077,31.3875,,S
27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,,C
28,0,1,"Fortune, Mr. Charles Alexander",male,19,3,2,19950,263,C23 C25 C27,S
29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
30,0,3,"Todoroff, Mr. Lalio",male,,0,0,349216,7.8958,,S
31,0,1,"Uruchurtu, Don. Manuel E",male,40,0,0,PC 17601,27.7208,,C
32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
33,1,3,"Glynn, Miss. Mary Agatha",female,,0,0,335677,7.75,,Q
34,0,2,"Wheadon, Mr. Edward H",male,66,0,0,C.A. 24579,10.5,,S
35,0,1,"Meyer, Mr. Edgar Joseph",male,28,1,0,PC 17604,82.1708,,C
36,0,1,"Holverson, Mr. Alexander Oskar",male,42,1,0,113789,52,,S
37,1,3,"Mamee, Mr. Hanna",male,,0,0,2677,7.2292,,C
38,0,3,"Cann, Mr. Ernest Charles",male,21,0,0,A./5. 2152,8.05,,S
39,0,3,"Vander Planke, Miss. Augusta Maria",female,18,2,0,345764,18,,S
40,1,3,"Nicola-Yarred, Miss. Jamila",female,14,1,0,2651,11.2417,,C

Space error in a Virtual Environment/Machine?

Hi Team,
Question on installing this lightAutoML:

I am trying to use this installation in a specific environment or machine where the space is low. Can we customize to the installations related to only multiclass problem only (non-NLP data)? I really do not want any installations of NLP/Image processing, may be these are taking lot of space. How can I customize my installations according to problem? In this case I want to use only lightgbm/catboost hyper tuned stuff only.

Regards
Shravan

can I get the automl.fit_predict output as a html report?

First of all, Thanks for open-sourcing this library.
Currently when we run the automl object's fit_predict on our dataset; it outputs all the detailed information about whatever model was trained and how did that work out. For models with huge number of parameters; or pipeline with too many models; this report is important and needs to be available as a parsable report. Is there a feature already where I can get this report or store it somewhere in a file ? if not this would make a good feature.

TabularAutoML object has no attribute 'reader'

I am currently having an issue where I have replicated another project. After finishing the model, when trying to predict, I get:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
C:\Users\ALEXAN~1\AppData\Local\Temp/ipykernel_21260/2512481675.py in <module>
----> 1 testPreds = automl_rd.predict(test)

~\.conda\envs\threeNine2\lib\site-packages\lightautoml\report\report_deco.py in predict(self, *args, **kwargs)
    729         self._n_test_sample += 1
    730         # get predictions
--> 731         test_preds = self._model.predict(*args, **kwargs)
    732 
    733         test_data = kwargs["test"] if "test" in kwargs else args[0]

~\.conda\envs\threeNine2\lib\site-packages\lightautoml\automl\presets\tabular_presets.py in predict(self, data, features_names, batch_size, n_jobs, return_all_predictions)
    543         if batch_size is None and n_jobs == 1:
    544             data, _ = read_data(data, features_names, self.cpu_limit, read_csv_params)
--> 545             pred = super().predict(data, features_names, return_all_predictions)
    546             return cast(NumpyDataset, pred)
    547 

~\.conda\envs\threeNine2\lib\site-packages\lightautoml\automl\base.py in predict(self, data, features_names, return_all_predictions)
    290 
    291         """
--> 292         dataset = self.reader.read(data, features_names=features_names, add_array_attrs=False)
    293 
    294         for n, level in enumerate(self.levels, 1):

AttributeError: 'TabularAutoML' object has no attribute 'reader'

I'm not sure how this error occurs as the TabularModel was instantiated with reader?

Thank you in advance

Where is profiler?

In the tutorial on analytics india, Profiler was used.

It was imported with from lightautoml.utils.profiler import Profiler, but there's no reference in the official documentation and importing it raises ModuleNotFoundError: No module named 'lightautoml.utils.profiler'.

Is Profiler gone or moved to another module or rebranded or ...?

Error while import in docker image ubi8 (redhat)

I've got the issue while using lightautoml inside docker image build on base image ubi8.
The command
import lightautoml
lead to error (see a screenshot)

Exploding of linear models for non-smooth loss function

It turned out that the package fails when I use "Used Card Price" dataset and RMSLE loss function (see the error stack below).

After long investigation I can conclude that:

The origin of problem is "Year" feature. It works fine if this feature is dropped.
More deeply, coefficients of categorical features grows infinitely and becomes NaN for some reason if regularization term is small.

I suppose that the problem is unexpected behavior of LBFGS solver with non-smooth loss function and propose several solutions:

replace LBFGS on first-order solver (haven't tested);
decrease learning rate (works fine);
do not use Wolfe condition (works fine);
catch and process unexpected behavior to finish it correctly.

So far, I open PR to catch the pipeline failure.

[12:39:46] Stdout logging level is DEBUG.
[12:39:46] Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer
[12:39:46] Task: reg

[12:39:46] Start automl preset with listed constraints:
[12:39:46] - time: 10000000000000.00 seconds
[12:39:46] - CPU: 4 cores
[12:39:46] - memory: 16 GB

[12:39:46] Train data shape: (6019, 14)

[12:39:50] Feats was rejected during automatic roles guess: []
[12:39:50] Layer 1 train process start. Time left 9999999999996.40 secs
[12:39:50] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[12:39:50] Training params: {'tol': 1e-06, 'max_iter': 100, 'cs': [1e-05, 5e-05, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000], 'early_stopping': 2, 'categorical_idx': [21, 22, 23], 'embed_sizes': array([11, 12,  3], dtype=int32), 'data_size': 24}
[12:39:50] ===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====
[12:39:50] Linear model: C = 1e-05 score = -128.8645392871719
[12:39:50] Linear model: C = 5e-05 score = -121.64156159218489
[12:39:50] Linear model: C = 0.0001 score = -114.93390055055276
[12:39:50] Linear model: C = 0.0005 score = -89.57974689706185
[12:39:50] Linear model: C = 0.001 score = -77.2702312814782
Traceback (most recent call last):
  File "/Users/user/projects/LAML_dev/work/RMSLE_issue.py", line 21, in <module>
    main()
  File "/Users/user/projects/LAML_dev/work/RMSLE_issue.py", line 18, in main
    automl.fit_predict(df, roles=roles, verbose=5)
  File "/Users/user/projects/LAML_dev/LightAutoML/lightautoml/automl/presets/tabular_presets.py", line 525, in fit_predict
    train, roles=roles, cv_iter=cv_iter, valid_data=valid_data, verbose=verbose
  File "/Users/user/projects/LAML_dev/LightAutoML/lightautoml/automl/presets/base.py", line 211, in fit_predict
    verbose,
  File "/Users/user/projects/LAML_dev/LightAutoML/lightautoml/automl/base.py", line 225, in fit_predict
    pipe_pred = ml_pipe.fit_predict(train_valid)
  File "/Users/user/projects/LAML_dev/LightAutoML/lightautoml/pipelines/ml/base.py", line 150, in fit_predict
    ), "Pipeline finished with 0 models for some reason.\nProbably one or more models failed"
AssertionError: Pipeline finished with 0 models for some reason.
Probably one or more models failed
[12:39:50] Model Lvl_0_Pipe_0_Mod_0_LinearL2 failed during ml_algo.fit_predict call.

Input contains NaN, infinity or a value too large for dtype('float32').

Incorrect gensim.models.FastText usage

The library looks great, thank you for the work. But I had an issue running TabularNLPAutoML.
You use an arbitrary version of gensim setting gensim = "*" in the dependencies here. It leads to using any version of gensim including 4.* where parameter size of gensim.models.FastText class was renamed in favour of vector_size. This causes a failure running TabularNLPAutoML. Overall, consider fixed versions of dependencies for not having such issues all the time.

AssertionError: Pipeline finished with 0 models for some reason

I might have coded this incorrectly (very likley) but I'm getting the error:

AssertionError: Pipeline finished with 0 models for some reason``

Reproducible test case:

nlp.zip

lightautoml==0.2.14
huggingface-hub==0.0.8
transformers==4.6.1
python=3.9.5


LightAutoML is an amazing project, thank you

The kernel appears to have died

Hi!

When linear_l2 ends and lgb starts the kernel dies.
Tried updating libraries, nothing worked. Dataset size seems to be ok, about 30k observations and 10 columns
Using Jupyter on Mac OS.

Thanks!

Error on Import

With the latest version, after installing I always get an error on first import:

>>> import lightautoml

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/venv/lib/python3.9/site-packages/lightautoml/__init__.py", line 19, in <module>
    from .automl.presets import *
  File "/venv/lib/python3.9/site-packages/lightautoml/automl/presets/image_presets.py", line 17, in <module>
    from ...pipelines.features.image_pipeline import ImageSimpleFeatures, ImageAutoFeatures
  File "/venv/lib/python3.9/site-packages/lightautoml/pipelines/features/image_pipeline.py", line 12, in <module>
    from ...transformers.image import ImageFeaturesTransformer, AutoCVWrap
  File "/venv/lib/python3.9/site-packages/lightautoml/transformers/image.py", line 15, in <module>
    from ..image.image import CreateImageFeatures, DeepImageEmbedder
  File "/venv/lib/python3.9/site-packages/lightautoml/image/image.py", line 6, in <module>
    import cv2
  File "/venv/lib/python3.9/site-packages/cv2/__init__.py", line 5, in <module>
    from .cv2 import *
ImportError: libGL.so.1: cannot open shared object file: No such file or directory

It seems to be an issue with opencv-python and requires an install of the python3-opencv package (apt-get install -y python3-opencv). Making the CV part of LAMA an optional install, or at least adding this to the documentation would be helpful.

TabularUtilizedAutoML for Multiclass

Apologies for what might be a stupid question -

    automl = TabularUtilizedAutoML(task=Task('multiclass')...)

    oof_pred = automl.fit_predict(
        df_train, roles={"target": TARGET_NAME}
    )

    test_pred = automl.predict(df_test)

The test_pred is in the predict_proba format, but what i'm looking for is for test_pred to have a method that will give me the label of the prediction, for Iris this would be one of Setosa/Virginica/Versicolor, the automl is automatically encoding the data frame during training for the target, is there a way to get the prediction label please? I can't find an attribute in automl like _classes either.

Is there a way please to call automl.predict and get back the prediction label?

[Request] Release notes

As far as I can tell, there are no release notes for the different releases. I want to suggest starting with release notes for new releases. At least I myself was trying to figure out if a particular property changed between releases, and found myself with no way to figure that out (other than empirically trying both versions).

Ml

Demo is not working

Tried to launch Demo https://colab.research.google.com/github/sberbank-ai-lab/LightAutoML/blob/master/examples/tutorials/Tutorial_4_NLP_Interpretation.ipynb
Error traceback:

Видео обзор библиотеки?

На ai-journey, вроде был обзор, но сейчас не могу его найти, помогите пожалуйста найти видео
Оно было бы очень уместно и было бы здорово добавить его на главную станицу либы, как альтернативу ноутбукам

Poetry cant solve deps

poetry 1.1.11 cant solve deps for lightautoml==0.3.0 and python 3.9.5

poetry update
Creating virtualenv test1-xVJJlKZl-py3.9 in /home/user/.cache/pypoetry/virtualenvs
Updating dependencies
Resolving dependencies... (0.0s)

SolverProblemError

The current project's Python requirement (>=3.9,<4.0) is not compatible with some of the required packages Python requirement:
- lightautoml requires Python >=3.6.1,<3.10, so it will not be satisfied for Python >=3.10,<4.0

Because no versions of lightautoml match >0.3.0,<0.4.0
and lightautoml (0.3.0) requires Python >=3.6.1,<3.10, lightautoml is forbidden.
So, because test1 depends on lightautoml (^0.3.0), version solving failed.

at ~/.local/lib/python3.9/site-packages/poetry/puzzle/solver.py:241 in _solve
237│ packages = result.packages
238│ except OverrideNeeded as e:
239│ return self.solve_in_compatibility_mode(e.overrides, use_latest=use_latest)
240│ except SolveFailure as e:
→ 241│ raise SolverProblemError(e)
242│
243│ results = dict(
244│ depth_first_search(
245│ PackageNode(self._package, packages), aggregate_package_nodes

• Check your dependencies Python requirement: The Python requirement can be specified via the python or markers properties

For lightautoml, a possible solution would be to set the `python` property to ">=3.9,<3.10"

https://python-poetry.org/docs/dependency-specification/#python-restricted-dependencies,
https://python-poetry.org/docs/dependency-specification/#using-environment-markers

AttributeError: module 'lightautoml' has no attribute 'version'

After successful run of pip install lightautoml, I tried to see the version of the module:

import lightautoml
print(lightautoml.__version__)

Unfortunately, I faced the following:

AttributeError: module 'lightautoml' has no attribute '__version__'

Is it OK? How come I can't see package version...

OSError on MacOs

After successful installation, it throws this error when trying to import the library on Mac Os python 3.8. How can I fix this?

OSError Traceback (most recent call last)
in
20 from sklearn.linear_model import LinearRegression, LogisticRegression
21 from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
---> 22 import lightautoml
23 from lightautoml.automl.presets.tabular_presets import TabularAutoML
24 from lightautoml.tasks import Task

~/.local/lib/python3.8/site-packages/lightautoml/init.py in
14 version = importlib_metadata.version(name)
15
---> 16 from .addons import *
17 from .addons.utilization import *
18 from .automl import *

~/.local/lib/python3.8/site-packages/lightautoml/addons/utilization/init.py in
1 """Tools to configure resources utilization."""
----> 2 from .utilization import TimeUtilization
3
4 all = ['TimeUtilization']

~/.local/lib/python3.8/site-packages/lightautoml/addons/utilization/utilization.py in
6 from log_calls import record_history
7
----> 8 from ...automl.base import AutoML
9 from ...automl.blend import Blender, BestModelSelector
10 from ...automl.presets.base import AutoMLPreset

~/.local/lib/python3.8/site-packages/lightautoml/automl/base.py in
6 from log_calls import record_history
7
----> 8 from .blend import Blender, BestModelSelector
9 from ..dataset.base import LAMLDataset
10 from ..dataset.utils import concatenate

~/.local/lib/python3.8/site-packages/lightautoml/automl/blend.py in
7 from scipy.optimize import minimize_scalar
8
----> 9 from ..dataset.base import LAMLDataset
10 from ..dataset.np_pd_dataset import NumpyDataset
11 from ..dataset.roles import NumericRole

~/.local/lib/python3.8/site-packages/lightautoml/dataset/base.py in
7
8 from .roles import ColumnRole
----> 9 from ..tasks.base import Task
10
11 valid_array_attributes = ('target', 'group', 'folds', 'weights')

~/.local/lib/python3.8/site-packages/lightautoml/tasks/init.py in
1 """Define the task to solve its loss, metric."""
2
----> 3 from .base import Task
4
5 all = ['losses', 'base', 'common_metric', 'utils', 'Task']

~/.local/lib/python3.8/site-packages/lightautoml/tasks/base.py in
8 from log_calls import record_history
9
---> 10 from lightautoml.tasks.losses import LGBLoss, SKLoss, TORCHLoss, CBLoss
11 from .common_metric import _valid_str_metric_names, _valid_metric_args
12 from .utils import infer_gib, infer_gib_multiclass

~/.local/lib/python3.8/site-packages/lightautoml/tasks/losses/init.py in
3 from .base import _valid_str_metric_names
4 from .cb import CBLoss
----> 5 from .lgb import LGBLoss
6 from .sklearn import SKLoss
7 from .torch import TORCHLoss, TorchLossWrapper

~/.local/lib/python3.8/site-packages/lightautoml/tasks/losses/lgb.py in
4 from typing import Callable, Tuple, Union, Optional, Dict
5
----> 6 import lightgbm as lgb
7 import numpy as np
8 from log_calls import record_history

~/.local/lib/python3.8/site-packages/lightgbm/init.py in
6 from future import absolute_import
7
----> 8 from .basic import Booster, Dataset
9 from .callback import (early_stopping, print_evaluation, record_evaluation,
10 reset_parameter)

~/.local/lib/python3.8/site-packages/lightgbm/basic.py in
31
32
---> 33 _LIB = _load_lib()
34
35

~/.local/lib/python3.8/site-packages/lightgbm/basic.py in _load_lib()
26 if len(lib_path) == 0:
27 return None
---> 28 lib = ctypes.cdll.LoadLibrary(lib_path[0])
29 lib.LGBM_GetLastError.restype = ctypes.c_char_p
30 return lib

/Library/anaconda3/lib/python3.8/ctypes/init.py in LoadLibrary(self, name)
449
450 def LoadLibrary(self, name):
--> 451 return self._dlltype(name)
452
453 cdll = LibraryLoader(CDLL)

/Library/anaconda3/lib/python3.8/ctypes/init.py in init(self, name, mode, handle, use_errno, use_last_error, winmode)
371
372 if handle is None:
--> 373 self._handle = _dlopen(self._name, mode)
374 else:
375 self._handle = handle

OSError: dlopen(/Users/a185583357/.local/lib/python3.8/site-packages/lightgbm/lib_lightgbm.so, 6): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
Referenced from: /Users/a185583357/.local/lib/python3.8/site-packages/lightgbm/lib_lightgbm.so
Reason: image not found

ERROR: torch has an invalid wheel, .dist-info directory not found

I am getting this error while trying to pip install in a windows machine. Pip version is 20.0.2

"ERROR: torch has an invalid wheel, .dist-info directory not found"

Data downloader error

python3.8.7
python demo11.py

Calling torch.data.dataloader error

r

Issue with "Tutorial_1. Create your own pipeline.ipynb"

Hi!

When running the first tutorial on Google.Colab from the colab button in README.md I've faced the following issue in the import of lightautoml:

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

I've found a solution here in the first answer. But it seems that the notebook should be updated.

Be able to put TabularNLPAutoML into an sklearn pipeline

    steps = [
        ('automl', TabularNLPAutoML(...))
    ]
    pipeline = Pipeline(steps)

    pred = pipeline.fit_predict(
        df_train,
        roles={"target": TARGET_NAME, "text": TEXT_COLUMNS, "drop": DROP_FEATURES},
    )

produces:

TypeError: Last step of Pipeline should implement fit or be the string 'passthrough'.

I'm doing this because I have some text clean up Transformers that I'd like to pickle in one "model" object so the same clean up happens at inference time.

pip installs dev packages with lama

I install LightAutoML with pip and get pytest, sphinx and it helpers installed. They come with autowoe. Please, have a look.

Just funny: I am sure I had no chance to install korean-lunar-calendar, but now I have it, thank you, Lama ❤️

report deco error

Hi, thanks as always for this great library! I keep running into an issue when trying to use the report decorator:

---------------------------------------------------------------------------
UFuncTypeError Traceback (most recent call last)
in
----> 1 oof_pred2 = rml.fit_predict(bldf[ucols],
2 #log_file = f'{target}_log.log',
3 roles = roles,
4 verbose = 1,
5 )

~/.conda/envs/mamba/lib/python3.9/site-packages/lightautoml/report/report_deco.py in fit_predict(self, *args, **kwargs)
701 # generate train data section
702 self._train_data_overview = self._data_genenal_info(train_data)
--> 703 self._describe_roles(train_data)
704 self._describe_dropped_features(train_data)
705 self._generate_train_set_section()

~/.conda/envs/mamba/lib/python3.9/site-packages/lightautoml/report/report_deco.py in _describe_roles(self, train_data)
972 values = train_data[feature_name].dropna().values
973 item["min"] = np.min(values)
--> 974 item["quantile_25"] = np.quantile(values, 0.25)
975 item["average"] = np.mean(values)
976 item["median"] = np.median(values)

<__array_function__ internals> in quantile(*args, **kwargs)

~/.conda/envs/mamba/lib/python3.9/site-packages/numpy/lib/function_base.py in quantile(a, q, axis, out, overwrite_input, interpolation, keepdims)
3977 if not _quantile_is_valid(q):
3978 raise ValueError("Quantiles must be in the range [0, 1]")
-> 3979 return _quantile_unchecked(
3980 a, q, axis, out, overwrite_input, interpolation, keepdims)
3981

~/.conda/envs/mamba/lib/python3.9/site-packages/numpy/lib/function_base.py in _quantile_unchecked(a, q, axis, out, overwrite_input, interpolation, keepdims)
3984 interpolation='linear', keepdims=False):
3985 """Assumes that q is in [0, 1], and is an ndarray"""
-> 3986 r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
3987 overwrite_input=overwrite_input,
3988 interpolation=interpolation)

~/.conda/envs/mamba/lib/python3.9/site-packages/numpy/lib/function_base.py in _ureduce(a, func, **kwargs)
3562 keepdim = (1,) * a.ndim
3563
-> 3564 r = func(a, **kwargs)
3565 return r, keepdim
3566

~/.conda/envs/mamba/lib/python3.9/site-packages/numpy/lib/function_base.py in _quantile_ureduce_func(failed resolving arguments)
4110 x_above = take(ap, indices_above, axis=0)
4111
-> 4112 r = _lerp(x_below, x_above, weights_above, out=out)
4113
4114 # if any slice contained a nan, then all results on that slice are also nan

~/.conda/envs/mamba/lib/python3.9/site-packages/numpy/lib/function_base.py in _lerp(a, b, t, out)
4007 def _lerp(a, b, t, out=None):
4008 """ Linearly interpolate from a to b by a factor of t """
-> 4009 diff_b_a = subtract(b, a)
4010 # asanyarray is a stop-gap until gh-13105
4011 lerp_interpolation = asanyarray(add(a, diff_b_a*t, out=out))

UFuncTypeError: ufunc 'subtract' did not contain a loop with signature matching types (dtype(' None

At first I thought it was an issue when there are object-type columns, but now I'm not sure, is there a workaround for this?

Task never completes (multiclass)

This script below ran for many hours (MacbookPro (current Intel model), no GPU, 16GB RAM) before I killed it, some runs give me an error but it keeps going:
An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

I set a timeout of an hour which seems to be ignored, I've tried playing with different algorithms but can't this to make a model, every time I give up after running it all night.

import numpy as np
import pandas as pd
from lightautoml.automl.presets.text_presets import TabularAutoML, TabularNLPAutoML
from lightautoml.tasks import Task
from sklearn import preprocessing

from sklearn.model_selection import train_test_split

df = pd.read_json("https://github.com/nomadotto/News_Classifier/blob/master/News_Category_Dataset_v2.json?raw=true", lines=True)

print(df.head())

automl = TabularNLPAutoML(
    task=Task("multiclass"),
    timeout=3600,
    verbose=2,
    general_params={"use_algos": ["lgb", "cb"]},
    reader_params={"cv": 5, "random_state": 42},
    text_params={"lang": "en"},
    gbm_pipeline_params={"text_features": "tfidf"},
    tfidf_params={
        "svd": True,
        "tfidf_params": {
            "ngram_range": (1, 2),
            "sublinear_tf": True,
            "max_features": 1500,
        },
    },
)

print("splitting...")
df_train, df_test = train_test_split(
    df,
    test_size=0.2,
    shuffle=True,
    random_state=42,
)

print("fitting...")
oof_pred = automl.fit_predict(
    df_train,
    roles={
        "target": "category",
        "text": ["headline", "short_description"],
        "drop": ["authors", "link", "date"]
    }
)

print(oof_pred)

Problem with TabularNLPAutoML

When trying to use TabularNLPAutoML preset there is a problem occurs with DataLoader (RuntimeError: DataLoader worker (pid 2645328) is killed by signal: Segmentation fault.). Full log can be found here - full log.txt.

To replicate I:

Created a fresh conda environment
Installed LAMA using pip install -U lightautoml (automl version 0.2.16, torch 1.8.1, can report any other packages if necessary)
For this issue downloaded a twitter sentiment analysis dataset from kaggle. I think any nlp text can lead to this problem, as long as the text is big enough (I was actually working on completely different dataset)
Run the code (I've put dataset into ./tmp directory):

import pandas as pd
from pathlib import Path

from lightautoml.automl.presets.text_presets import TabularNLPAutoML
from lightautoml.tasks import Task
from lightautoml.report.report_deco import ReportDecoNLP

data_dir = Path('./tmp')
TARGET_NAME = 'label'
THREAD_N = 32
FOLDS = 5
TIMEOUT = 3600
STATE = 42

df = pd.read_csv(data_dir / 'train.csv')

task = Task('binary')
roles = {'target': TARGET_NAME,
         'text': ['tweet'],
         }
RD = ReportDecoNLP()
automl = TabularNLPAutoML(task=task,
                          timeout=TIMEOUT,
                          cpu_limit=THREAD_N,
                          general_params={'use_algos': [['lgb', 'lgb_tuned']]},
                          reader_params={'n_jobs': THREAD_N, 'cv': FOLDS, 'random_state': STATE},
                          gbm_pipeline_params={'text_features': "embed"},
                          text_params={'lang': 'multi'},
                          )
automl_rd = RD(automl)
oof_pred = automl_rd.fit_predict(df, roles=roles)
print('oof_pred:\n{}\nShape = {}'.format(oof_pred, oof_pred.shape))

After running this script I get the error as in the previously mentioned log.txt.

I tried to look at the processes through the debug, but, unfortunately, my knowledge in multiprocessing of your library as well as torch library isn't enough to provide a more thorough explanations. And there seems to be a connection to my particular setup.

My machine:

CPU: Threadripper 2950x
GPU: 2x1080TI

When trying to fix the problem, I found out that setting THREAD_N to lower numbers (1-2 in my) seems to fix it. I also noticed, that the library locates both of my gpus and down the line it, by default, sets the device for the DLTransformer as the first gpu, which then loads both model and texts on the gpu. My guess is that during the multiple worker fetching in DataLoader it tries to load all of the batches on the gpu (which have only 11GB of memory), since this problem doesn't occur on lower amount of threads. One of the fixes that I've come up with is to disable GPU for the the pipeline all together, e.g.:

automl = TabularNLPAutoML(task=task,
                           ....
                           gpu_ids=None
                          )

Is this an intended behavior? What if someone would wanted to use the gpu to speed up their processing?

Inquiring about model saving feature

So I just trained a model on some tabular data using TabularUtilizedAutoML and wanted to save the model. But I couldn't find anything related to saving the best trained model.
Just wanted to know if there is a functionality to do so ?