nyanp / nyaggle Goto Github PK

View Code? Open in Web Editor NEW

288.0 11.0 30.0 712 KB

Code for Kaggle and Offline Competitions

License: MIT License

Python 100.00%

machine-learning kaggle feature-engineering experiment-tracking ml

nyaggle's People

Contributors

Stargazers

Watchers

nyaggle's Issues

StratifiedGroupKFold

requirements.txtでインストールされるライブラリの整理

https://github.com/nyanp/nyaggle/blob/master/requirements-dev.txt からテスト時に必要なライブラリを抽出すべきかと思いました。
例えばpytestについてはテスト時のみ必要なライブラリのため
requirements-test.txtを新たに作成して、そちらに移行したほうがよいとかんじました。

Flatten logging parameters to improve readability

In run_experiment, model_params and fit_params are stored as string to distinguish them from other parameters and it makes difficult to compare params across experiments.

It seems better to flatten the dictionary as in the PR below.
mlflow/mlflow#1863

Before

{
    "fit_params": "{ early_stopping_rounds: 100 }",
    "model_params": "{ max_depth: 3, objective: \"binary\" }",
    "algorithm_type": "lgbm"
}

After

{
    "fit_params.early_stopping_rounds": 100,
    "model_params.max_depth": 3,
    "model_params.objective": "binary",
    "algorithm_type": "lgbm"
}

Drop Python 3.5 support temporally

nyaggle needs to drop Python3.5 support because the latest version of xgboost is incompatible with Python 3.5 (they use f-strings). I guess the change is not intended, but we need to drop Python 3.5 from our CI to avoid failing test continuously.

Add Optuna's LightGBM integration

remove experiment-tracking topic label from the project

This project aims for examples of code competition rather that an experiment-tracking tool, I wonder if the repo holder minds tweaking the topic labels a bit. @nyanp

generate predict/actual scatterplot in experiment

Starter notebook is in progress.

What content should be included?
below link is draft.
https://drive.google.com/open?id=1aSiplVhB9Hjcj8Ib-A9zkAjEgykpllH4

cross_validate don't work with LightGBM v4.0.0

Thanks for publishing such a useful tool!

A few days ago, LightGBM's new version 4.0.0 has been released.
In this release, early_stopping_rounds argument in fit() was removed.

So, functions that use cross_validate() such as run_experiment don't work.
(There may be other functions that don't work, I haven't investigated yet.)

Of cource, there is no probrem with versions before 3.3.5.

pytest log

(nyaggle) yuta100101:~/nyaggle(master =)$ pytest tests/validation/test_cross_validate.py::test_cv_lgbm
========================================================================================== test session starts ===========================================================================================
platform linux -- Python 3.9.17, pytest-7.4.0, pluggy-1.2.0
rootdir: /home/yuta100101/practice/nyaggle
collected 1 item                                                                                                                                                                                         

tests/validation/test_cross_validate.py F                                                                                                                                                          [100%]

================================================================================================ FAILURES ================================================================================================
______________________________________________________________________________________________ test_cv_lgbm ______________________________________________________________________________________________

    def test_cv_lgbm():
        X, y = make_classification(n_samples=1024, n_features=20, class_sep=0.98, random_state=0)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
    
        models = [LGBMClassifier(n_estimators=300) for _ in range(5)]
    
>       pred_oof, pred_test, scores, importance = cross_validate(models, X_train, y_train, X_test, cv=5,
                                                                 eval_func=roc_auc_score,
                                                                 fit_params={'early_stopping_rounds': 200})

tests/validation/test_cross_validate.py:52: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

estimator = [LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300)]
X_train =            0         1         2         3         4         5         6         7         8   ...        11        12... ... -0.109782 -0.412230  1.707714 -0.240937 -0.276747  0.481276 -0.278111  1.304773 -0.139538

[512 rows x 20 columns]
y = 0      0
1      0
2      0
3      1
4      0
      ..
507    0
508    1
509    0
510    1
511    0
Name: target, Length: 512, dtype: int64
X_test =            0         1         2         3         4         5         6         7         8   ...        11        12... ... -2.598922 -0.351561  0.233836 -1.873634 -1.089221  0.373956 -0.520939 -0.489945  2.452996

[512 rows x 20 columns]
cv = KFold(n_splits=5, random_state=0, shuffle=True), groups = None, eval_func = <function roc_auc_score at 0x7fe910196ee0>, logger = <Logger nyaggle.validation.cross_validate (WARNING)>
on_each_fold = None, fit_params = {'early_stopping_rounds': 200}, importance_type = 'gain', early_stopping = True, type_of_target = 'binary'

    def cross_validate(estimator: Union[BaseEstimator, List[BaseEstimator]],
                       X_train: Union[pd.DataFrame, np.ndarray], y: Union[pd.Series, np.ndarray],
                       X_test: Union[pd.DataFrame, np.ndarray] = None,
                       cv: Optional[Union[int, Iterable, BaseCrossValidator]] = None,
                       groups: Optional[pd.Series] = None,
                       eval_func: Optional[Callable] = None, logger: Optional[Logger] = None,
                       on_each_fold: Optional[Callable[[int, BaseEstimator, pd.DataFrame, pd.Series], None]] = None,
                       fit_params: Optional[Union[Dict[str, Any], Callable]] = None,
                       importance_type: str = 'gain',
                       early_stopping: bool = True,
                       type_of_target: str = 'auto') -> CVResult:
        """
        Evaluate metrics by cross-validation. It also records out-of-fold prediction and test prediction.
    
        Args:
            estimator:
                The object to be used in cross-validation. For list inputs, ``estimator[i]`` is trained on i-th fold.
            X_train:
                Training data
            y:
                Target
            X_test:
                Test data (Optional). If specified, prediction on the test data is performed using ensemble of models.
            cv:
                int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.
    
                - None, to use the default ``KFold(5, random_state=0, shuffle=True)``,
                - integer, to specify the number of folds in a ``(Stratified)KFold``,
                - CV splitter (the instance of ``BaseCrossValidator``),
                - An iterable yielding (train, test) splits as arrays of indices.
            groups:
                Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., ``GroupKFold``).
            eval_func:
                Function used for logging and returning scores
            logger:
                logger
            on_each_fold:
                called for each fold with (idx_fold, model, X_fold, y_fold)
            fit_params:
                Parameters passed to the fit method of the estimator
            importance_type:
                The type of feature importance to be used to calculate result.
                Used only in ``LGBMClassifier`` and ``LGBMRegressor``.
            early_stopping:
                If ``True``, ``eval_set`` will be added to ``fit_params`` for each fold.
                ``early_stopping_rounds = 100`` will also be appended to fit_params if it does not already have one.
            type_of_target:
                The type of target variable. If ``auto``, type is inferred by ``sklearn.utils.multiclass.type_of_target``.
                Otherwise, ``binary``, ``continuous``, or ``multiclass`` are supported.
        Returns:
            Namedtuple with following members
    
            * oof_prediction (numpy array, shape (len(X_train),)):
                The predicted value on put-of-Fold validation data.
            * test_prediction (numpy array, hape (len(X_test),)):
                The predicted value on test data. ``None`` if X_test is ``None``.
            * scores (list of float, shape (nfolds+1,)):
                ``scores[i]`` denotes validation score in i-th fold.
                ``scores[-1]`` is the overall score. `None` if eval is not specified.
            * importance (list of pandas DataFrame, shape (nfolds,)):
                ``importance[i]`` denotes feature importance in i-th fold model.
                If the estimator is not GBDT, empty array is returned.
    
        Example:
            >>> from sklearn.datasets import make_regression
            >>> from sklearn.linear_model import Ridge
            >>> from sklearn.metrics import mean_squared_error
            >>> from nyaggle.validation import cross_validate
    
            >>> X, y = make_regression(n_samples=8)
            >>> model = Ridge(alpha=1.0)
            >>> pred_oof, pred_test, scores, _ = \
            >>>     cross_validate(model,
            >>>                    X_train=X[:3, :],
            >>>                    y=y[:3],
            >>>                    X_test=X[3:, :],
            >>>                    cv=3,
            >>>                    eval_func=mean_squared_error)
            >>> print(pred_oof)
            [-101.1123267 ,   26.79300693,   17.72635528]
            >>> print(pred_test)
            [-10.65095894 -12.18909059 -23.09906427 -17.68360714 -20.08218267]
            >>> print(scores)
            [71912.80290003832, 15236.680239881942, 15472.822033121925, 34207.43505768073]
        """
        cv = check_cv(cv, y)
        n_output_cols = 1
        if type_of_target == 'auto':
            type_of_target = multiclass.type_of_target(y)
        if type_of_target == 'multiclass':
            n_output_cols = y.nunique(dropna=True)
    
        if isinstance(estimator, list):
            assert len(estimator) == cv.get_n_splits(), "Number of estimators should be same to nfolds."
    
        X_train = convert_input(X_train)
        y = convert_input_vector(y, X_train.index)
        if X_test is not None:
            X_test = convert_input(X_test)
    
        if not isinstance(estimator, list):
            estimator = [estimator] * cv.get_n_splits()
    
        assert len(estimator) == cv.get_n_splits()
    
        if logger is None:
            logger = getLogger(__name__)
    
        def _predict(model: BaseEstimator, x: pd.DataFrame, _type_of_target: str):
            if _type_of_target in ('binary', 'multiclass'):
                if hasattr(model, "predict_proba"):
                    proba = model.predict_proba(x)
                elif hasattr(model, "decision_function"):
                    warnings.warn('Since {} does not have predict_proba method, '
                                  'decision_function is used for the prediction instead.'.format(type(model)))
                    proba = model.decision_function(x)
                else:
                    raise RuntimeError('Estimator in classification problem should have '
                                       'either predict_proba or decision_function')
                if proba.ndim == 1:
                    return proba
                else:
                    return proba[:, 1] if proba.shape[1] == 2 else proba
            else:
                return model.predict(x)
    
        oof = np.zeros((len(X_train), n_output_cols)) if n_output_cols > 1 else np.zeros(len(X_train))
        evaluated = np.full(len(X_train), False)
        test = None
        if X_test is not None:
            test = np.zeros((len(X_test), n_output_cols)) if n_output_cols > 1 else np.zeros(len(X_test))
    
        scores = []
        eta_all = []
        importance = []
    
        for n, (train_idx, valid_idx) in enumerate(cv.split(X_train, y, groups)):
            start_time = time.time()
    
            train_x, train_y = X_train.iloc[train_idx], y.iloc[train_idx]
            valid_x, valid_y = X_train.iloc[valid_idx], y.iloc[valid_idx]
    
            if fit_params is None:
                fit_params_fold = {}
            elif callable(fit_params):
                fit_params_fold = fit_params(n, train_idx, valid_idx)
            else:
                fit_params_fold = copy.copy(fit_params)
    
            if is_gbdt_instance(estimator[n], ('lgbm', 'cat', 'xgb')):
                if early_stopping:
                    if 'eval_set' not in fit_params_fold:
                        fit_params_fold['eval_set'] = [(valid_x, valid_y)]
                    if 'early_stopping_rounds' not in fit_params_fold:
                        fit_params_fold['early_stopping_rounds'] = 100
    
>               estimator[n].fit(train_x, train_y, **fit_params_fold)
E               TypeError: fit() got an unexpected keyword argument 'early_stopping_rounds'

nyaggle/validation/cross_validate.py:177: TypeError
======================================================================================== short test summary info =========================================================================================
FAILED tests/validation/test_cross_validate.py::test_cv_lgbm - TypeError: fit() got an unexpected keyword argument 'early_stopping_rounds'
=========================================================================================== 1 failed in 1.90s ============================================================================================

<\details>

Remove dependency to ubelt

ubels is only depend on StratifiedGroupKFold.
I think i can replace pandas.Series.groupby.

Time series split

TimeSeriesSplit in sklearn does not meet a dataset in Kaggle in most case (it assumes fixed time intervals). We need a sklearn compatible, practical CV splitter for general time series problem.

Integrate non-gbdt algorithm to experiment_gbdt function

experiment_gbdt can be general high-level experiment API with few changes.

rename gbdt_type with algorithm_type, and accept Estimator instance
implement appropriate dispatcher

catboost tuner

experiment_gbdt raise errors with long parameters and mlflow

mlflow raises error if length of key/value exceeds 250. If the length of gbdt parameters or cat_columns is long, experiment_gbdt will raise an exception.

Possible option:

catch and ignore all errors from mlflow
truncate logging parameters automatically

Support adversarial validation

Change default parameters of TargetEncoder

nyaggle's TargetEncoder is basically a KFold version of category_encoders, with the same interface.
category_encoders changes the default parameters in scikit-learn-contrib/category_encoders#327, so it would be better to apply the same change for nyaggle. (This also causes some CI tests to fail)

Personally, I don't think this new default parameter is always good, but as long as the nyaggle's implementation of target encoder is a thin wrapper of category_encoders, I think interface consistency should be a priority.

Question about docstring

nyaggle/nyaggle/validation/split.py

Line 69 in d5dd9cf

>>> split = Take(3, KFold(5))

This split should be folds ?

Recording GBDT output to log file

Drop Python v3.5 support

Wouldn't it be better to match the Python version supported by the library(XGBoost/LightGBM)?

XGBoost
- Breaking: XGBoost Python package now requires Python 3.6 and later (#5715) -
- Python v3.6^ support
LightGBM
- https://pypi.org/project/lightgbm/
- Python v3.5^ support

How do I record calculated scores using target variables before conversion? (e.g. log conversion)

This code is recorded the score of the target variable transformed by np.log1p.

train, test = load_dataset()
target_col = "y"
submit = make_sample_submission(test, target_col)
target = train[target_col]
target = target.map(np.log1p)
train.drop(columns=[target_col], inplace=True)
lightgbm_params = {
        "metric": "rmse",
        "objective": 'regression',
        "max_depth": 5,
        "num_leaves": 24,
        "learning_rate": 0.007,
        "n_estimators": 30000,
        "min_child_samples": 80,
        "subsample": 0.8,
        "colsample_bytree": 1,
        "reg_alpha": 0,
        "reg_lambda": 0,
    }

fit_params = {
        "early_stopping_rounds": 100,
        "verbose": 5000
    }

kf = KFold(n_splits=4)
lgb_result = run_experiment(lightgbm_params,
                                X_train=train,
                                y=target,
                                X_test=test,
                                eval_func=rmse,
                                cv=kf,
                                fit_params=fit_params,
                                logging_directory='resources/logs/'
                                                  'lightgbm/{time}',
                                sample_submission=submit)

Introducing flake8/black/mypy (or pysen)

Support lgb.cv and xgb.cv for cross-validation

Unlike the current implementation of cv in nyaggle, The models trained in lgb.cv and xgb.cv have an equal number of trees in all folds.

Since these “balanced” models may work better when the number of data is small, we sometimes want to extract the trained models from lgb.cv or xgb.cv and use them for test data.

So it would be useful to have the option to use these cv functions in nyaggle's run_experiment and cross_validate as well.

ref:
https://blog.amedama.jp/entry/lightgbm-cv-model
https://blog.amedama.jp/entry/xgboost-cv-model

pytest tmpdir fixture

This is not an issue. I found you use get_temp_directory in the tests to create a temporary directory. Is there any reason that you don't use the tmpdir fixture in pytest?

https://docs.pytest.org/en/latest/tmpdir.html#the-tmpdir-fixture

Support GCS as a logging directory for feature store and experiment

callback for customizing fit parameters for each fold

ValueError: Supported types are: <class 'str'> or typing.Callable. Got <class 'numpy._ArrayFunctionDispatcher'> instead.

概要

実行環境

Python 3.9.10 (tags/v3.9.10:f2f3f53, Jan 17 2022, 15:14:21) [MSC v.1929 64 bit (AMD64)] on win32

エラーメッセージ

pytestを実行したところ次のエラーメッセージが出力されました。

> pytest
========================================================================= short test summary info ========================================================================= 
FAILED tests/feature/test_groupby.py::test_return_type_by_aggregation - ValueError: Supported types are: <class 'str'> or typing.Callable. Got <class 'numpy._ArrayFunctionDispatcher'> instead.
FAILED tests/feature/nlp/test_bert.py::test_bert_jp - requests.exceptions.ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out.    
======================================================== 2 failed, 106 passed, 1084 warnings in 332.27s (0:05:32) =========================================================

pytest FAILURES message

_____________________________________________________________________ test_return_type_by_aggregation _____________________________________________________________________

iris_dataframe = ( sl sw pl pw species
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3... 3.4 5.4 2.3 2.0
149 5.9 3.0 5.1 1.8 2.0

[150 rows x 5 columns], 'species', ['sl', 'sw', 'pl', 'pw'])

def test_return_type_by_aggregation(iris_dataframe):
    df, group_key, group_values = iris_dataframe
    agg_methods = ["max", np.sum, custom_function]

  new_df, new_cols = aggregation(df, group_key, group_values,

                                   agg_methods)

tests\feature\test_groupby.py:27:

input_df = sl sw pl pw species
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 ... 6.5 3.0 5.2 2.0 2.0
148 6.2 3.4 5.4 2.3 2.0
149 5.9 3.0 5.1 1.8 2.0

[150 rows x 5 columns]
group_key = 'species', group_values = ['sl', 'sw', 'pl', 'pw']
agg_methods = ['max', <function sum at 0x000002226B6C2CF0>, <function custom_function at 0x000002221318C280>]

def aggregation(
        input_df: pd.DataFrame,
        group_key: str,
        group_values: List[str],
        agg_methods: List[Union[str, FunctionType]],
) -> Tuple[pd.DataFrame, List[str]]:
    """
    Aggregate values after grouping table rows by a given key.

    Args:
        input_df:
            Input data frame.
        group_key:
            Used to determine the groups for the groupby.
        group_values:
            Used to aggregate values for the groupby.
        agg_methods:
            List of function or function names, e.g. ['mean', 'max', 'min', numpy.mean].
            Do not use a lambda function because the name attribute of the lambda function cannot generate a unique string of column names in <lambda>.
    Returns:
        Tuple of output dataframe and new column names.
    """
    new_df = input_df.copy()

    new_cols = []
    for agg_method in agg_methods:
        if _is_lambda_function(agg_method):
            raise ValueError('Not supported lambda function.')
        elif isinstance(agg_method, str):
            pass
        elif isinstance(agg_method, FunctionType):
            pass
        else:

          raise ValueError('Supported types are: {} or {}.'

                             ' Got {} instead.'.format(str, Callable, type(agg_method)))

E ValueError: Supported types are: <class 'str'> or typing.Callable. Got <class 'numpy._ArrayFunctionDispatcher'> instead.

nyaggle\feature\groupby.py:89: ValueError

エラー原因

テストコードではaggregationの引数agg_methodsの期待としてnumpy.sumが渡されています。

nyaggle/tests/feature/test_groupby.py

Lines 24 to 30 in 44b0169

 def test_return_type_by_aggregation(iris_dataframe): 

 df, group_key, group_values = iris_dataframe 

 agg_methods = ["max", np.sum, custom_function] 

 new_df, new_cols = aggregation(df, group_key, group_values, 

 agg_methods) 

 assert isinstance(new_df, pd.DataFrame) 

 assert isinstance(new_cols, list)

aggregationの引数agg_methodsの期待として以下の３つのみサポートされていますが

<class 'str'>
<class 'function'>
lambda

numpy.sumのクラスは<class 'numpy._ArrayFunctionDispatcher'>であるため、if文ではじかれるようになっています。

nyaggle/nyaggle/feature/groupby.py

Lines 81 to 90 in 44b0169

 for agg_method in agg_methods: 

 if _is_lambda_function(agg_method): 

 raise ValueError('Not supported lambda function.') 

 elif isinstance(agg_method, str): 

 pass 

 elif isinstance(agg_method, FunctionType): 

 pass 

 else: 

 raise ValueError('Supported types are: {} or {}.' 

 ' Got {} instead.'.format(str, Callable, type(agg_method)))

修正案

#105

Support xgboost in gbdt experiment

TargetEncoder can't convert to string (AttributeError)

Abstract

AttributeError occurs when TargetEncoder object passed to str() function.
This behavior has the potential to become an issue on debugging etc.

How to reproduce

The steps to reproduce are the following.

>>> from nyaggle.feature.category_encoder.target_encoder import TargetEncoder
>>> encoder = TargetEncoder()
>>> str(encoder)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/amedama/.virtualenvs/py310/lib/python3.10/site-packages/sklearn/base.py", line 279, in __repr__
    repr_ = pp.pformat(self)
  File "/usr/local/Cellar/[email protected]/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pprint.py", line 157, in pformat
    self._format(object, sio, 0, 0, {}, 0)
  File "/usr/local/Cellar/[email protected]/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pprint.py", line 174, in _format
    rep = self._repr(object, context, level)
  File "/usr/local/Cellar/[email protected]/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pprint.py", line 454, in _repr
    repr, readable, recursive = self.format(object, context.copy(),
  File "/Users/amedama/.virtualenvs/py310/lib/python3.10/site-packages/sklearn/utils/_pprint.py", line 189, in format
    return _safe_repr(
  File "/Users/amedama/.virtualenvs/py310/lib/python3.10/site-packages/sklearn/utils/_pprint.py", line 440, in _safe_repr
    params = _changed_params(object)
  File "/Users/amedama/.virtualenvs/py310/lib/python3.10/site-packages/sklearn/utils/_pprint.py", line 93, in _changed_params
    params = estimator.get_params(deep=False)
  File "/Users/amedama/.virtualenvs/py310/lib/python3.10/site-packages/sklearn/base.py", line 211, in get_params
    value = getattr(self, key)
AttributeError: 'TargetEncoder' object has no attribute 'cols'

TargetEncoder takes 'cols' parameter but doesn't save to the attribute.
(That transfers to category_encoders.TargetEncoder internally)
But scikit-learn BaseEstimator#get_params() expects that all __init__() parameters will be saved.
So accessing 'cols' raises AttributeError.

Environment

$ python -V                                   
Python 3.10.9
$ pip list | egrep -i "(nyaggle|scikit-learn)"
nyaggle            0.1.5
scikit-learn       1.2.0

mlflow integration

Support multiclass classification

PRのマージ権限についての提案：Collaboratorsの追加

こんにちは、@wakame1367です。

現在、PRをマージする権限はリポジトリの作成者にのみ与えられているようです。現状@nyanp さんのみがマージ権限を持つと@nyanpさんの負担が大きいと考えております。解決策としてコントリビューターにもその権限を与えると、負担を減らし、かつ開発の効率を向上させると考えています。

またPRを長期間放置してしまっているのも気になっております。

例えば次のPR
#96

そこで提案ですが、信頼できるコントリビューターをリポジトリのCollaboratorとして追加することはいかがでしょうか？ Collaboratorとして追加されたコントリビューターは、PRのマージ権限を含む追加の権限を得ることができます。

次は具体的なCollaboratorとして追加する方法の紹介です。参考までに。

GitHubにおけるPull Request(PR)のマージ権限を他のユーザーに与えるには、そのユーザーを該当リポジトリの"Collaborator"か"Team Member"として追加する必要があります。以下はその手順です：

Collaboratorとして追加する

まず、GitHubで該当リポジトリに移動します。
リポジトリのメインページから「Settings（設定）」タブを選択します。
左側のメニューで「Manage access（アクセスの管理）」を選択します。
「Invite a collaborator（コラボレーターを招待する）」をクリックします。
追加したいユーザーのGitHubユーザー名を入力し、「Add collaborator to [リポジトリ名]（[リポジトリ名]にコラボレーターを追加する）」をクリックします。
ユーザーが招待を受け入れると、そのユーザーはPRをレビューし、マージする権限を得ることができます。

https://docs.github.com/ja/account-and-profile/setting-up-and-managing-your-personal-account-on-github/managing-personal-account-settings/permission-levels-for-a-personal-account-repository

Does cross_validate not support multiclassification?

https://gist.github.com/wakamezake/5864abcebb3a5a2951af29de81a7ec97

Drop support for Python 3.6/3.7

The CI in #96 is failing on Python 3.6, Python 3.6 ~ 3.7 should be dropped from support and removed from CI as it is already EOL.

Fix CI errors

Customizable output layout

It would be nice if run_experiment accepts dictionary to customize layout of logging artifacts

Add hyperparameter zoo

Collecting hyperparameters used in Kaggle top solutions and providing API to use them seems good starting point of a new competition.

calculate statistics automatically in save_feature

Kolmogorov-Smirnov test
Adversarial validation
Percentage of nulls in train/test
Unseen category in test

環境設定: 新しい貢献者が開発環境を設定する方法についてのガイド。依存関係のインストール、ローカルでのプロジェクトのセットアップ、テストの実行方法など。
バグ報告: バグを発見した場合の報告方法。具体的な再現手順、期待する結果、実際の結果などを含む詳細な報告が役立つことを示すこと。
新機能の提案: 新しい機能や改善を提案する方法。具体的なユースケースと提案の詳細を提供することが重要であることを強調すること。
Pullリクエストの提出: 既存の問題を解決するためのPRを作成し、提出する方法。PRを提出する前に全てのテストが通ることを確認すること、変更の詳細な説明を提供することなど。
コードスタイル: 遵守すべきコーディング規約やスタイルガイドがある場合は、その詳細。
コミュニティガイドライン: プロジェクトの参加者が遵守すべき行動規範など。

リポジトリコントリビューターのためのガイドラインを定める

	def test_return_type_by_aggregation(iris_dataframe):
	df, group_key, group_values = iris_dataframe
	agg_methods = ["max", np.sum, custom_function]
	new_df, new_cols = aggregation(df, group_key, group_values,
	agg_methods)
	assert isinstance(new_df, pd.DataFrame)
	assert isinstance(new_cols, list)

	for agg_method in agg_methods:
	if _is_lambda_function(agg_method):
	raise ValueError('Not supported lambda function.')
	elif isinstance(agg_method, str):
	pass
	elif isinstance(agg_method, FunctionType):
	pass
	else:
	raise ValueError('Supported types are: {} or {}.'
	' Got {} instead.'.format(str, Callable, type(agg_method)))