nyanp / nyaggle Goto Github PK
View Code? Open in Web Editor NEWCode for Kaggle and Offline Competitions
License: MIT License
Code for Kaggle and Offline Competitions
License: MIT License
https://github.com/nyanp/nyaggle/blob/master/requirements-dev.txt からテスト時に必要なライブラリを抽出すべきかと思いました。
例えばpytest
についてはテスト時のみ必要なライブラリのため
requirements-test.txt
を新たに作成して、そちらに移行したほうがよいとかんじました。
In run_experiment
, model_params and fit_params are stored as string to distinguish them from other parameters and it makes difficult to compare params across experiments.
It seems better to flatten the dictionary as in the PR below.
mlflow/mlflow#1863
{
"fit_params": "{ early_stopping_rounds: 100 }",
"model_params": "{ max_depth: 3, objective: \"binary\" }",
"algorithm_type": "lgbm"
}
{
"fit_params.early_stopping_rounds": 100,
"model_params.max_depth": 3,
"model_params.objective": "binary",
"algorithm_type": "lgbm"
}
nyaggle needs to drop Python3.5 support because the latest version of xgboost is incompatible with Python 3.5 (they use f-strings). I guess the change is not intended, but we need to drop Python 3.5 from our CI to avoid failing test continuously.
This project aims for examples of code competition rather that an experiment-tracking tool, I wonder if the repo holder minds tweaking the topic labels a bit. @nyanp
What content should be included?
below link is draft.
https://drive.google.com/open?id=1aSiplVhB9Hjcj8Ib-A9zkAjEgykpllH4
Thanks for publishing such a useful tool!
A few days ago, LightGBM's new version 4.0.0 has been released.
In this release, early_stopping_rounds
argument in fit()
was removed.
So, functions that use cross_validate()
such as run_experiment
don't work.
(There may be other functions that don't work, I haven't investigated yet.)
Of cource, there is no probrem with versions before 3.3.5.
(nyaggle) yuta100101:~/nyaggle(master =)$ pytest tests/validation/test_cross_validate.py::test_cv_lgbm
========================================================================================== test session starts ===========================================================================================
platform linux -- Python 3.9.17, pytest-7.4.0, pluggy-1.2.0
rootdir: /home/yuta100101/practice/nyaggle
collected 1 item
tests/validation/test_cross_validate.py F [100%]
================================================================================================ FAILURES ================================================================================================
______________________________________________________________________________________________ test_cv_lgbm ______________________________________________________________________________________________
def test_cv_lgbm():
X, y = make_classification(n_samples=1024, n_features=20, class_sep=0.98, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
models = [LGBMClassifier(n_estimators=300) for _ in range(5)]
> pred_oof, pred_test, scores, importance = cross_validate(models, X_train, y_train, X_test, cv=5,
eval_func=roc_auc_score,
fit_params={'early_stopping_rounds': 200})
tests/validation/test_cross_validate.py:52:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
estimator = [LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300)]
X_train = 0 1 2 3 4 5 6 7 8 ... 11 12... ... -0.109782 -0.412230 1.707714 -0.240937 -0.276747 0.481276 -0.278111 1.304773 -0.139538
[512 rows x 20 columns]
y = 0 0
1 0
2 0
3 1
4 0
..
507 0
508 1
509 0
510 1
511 0
Name: target, Length: 512, dtype: int64
X_test = 0 1 2 3 4 5 6 7 8 ... 11 12... ... -2.598922 -0.351561 0.233836 -1.873634 -1.089221 0.373956 -0.520939 -0.489945 2.452996
[512 rows x 20 columns]
cv = KFold(n_splits=5, random_state=0, shuffle=True), groups = None, eval_func = <function roc_auc_score at 0x7fe910196ee0>, logger = <Logger nyaggle.validation.cross_validate (WARNING)>
on_each_fold = None, fit_params = {'early_stopping_rounds': 200}, importance_type = 'gain', early_stopping = True, type_of_target = 'binary'
def cross_validate(estimator: Union[BaseEstimator, List[BaseEstimator]],
X_train: Union[pd.DataFrame, np.ndarray], y: Union[pd.Series, np.ndarray],
X_test: Union[pd.DataFrame, np.ndarray] = None,
cv: Optional[Union[int, Iterable, BaseCrossValidator]] = None,
groups: Optional[pd.Series] = None,
eval_func: Optional[Callable] = None, logger: Optional[Logger] = None,
on_each_fold: Optional[Callable[[int, BaseEstimator, pd.DataFrame, pd.Series], None]] = None,
fit_params: Optional[Union[Dict[str, Any], Callable]] = None,
importance_type: str = 'gain',
early_stopping: bool = True,
type_of_target: str = 'auto') -> CVResult:
"""
Evaluate metrics by cross-validation. It also records out-of-fold prediction and test prediction.
Args:
estimator:
The object to be used in cross-validation. For list inputs, ``estimator[i]`` is trained on i-th fold.
X_train:
Training data
y:
Target
X_test:
Test data (Optional). If specified, prediction on the test data is performed using ensemble of models.
cv:
int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.
- None, to use the default ``KFold(5, random_state=0, shuffle=True)``,
- integer, to specify the number of folds in a ``(Stratified)KFold``,
- CV splitter (the instance of ``BaseCrossValidator``),
- An iterable yielding (train, test) splits as arrays of indices.
groups:
Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., ``GroupKFold``).
eval_func:
Function used for logging and returning scores
logger:
logger
on_each_fold:
called for each fold with (idx_fold, model, X_fold, y_fold)
fit_params:
Parameters passed to the fit method of the estimator
importance_type:
The type of feature importance to be used to calculate result.
Used only in ``LGBMClassifier`` and ``LGBMRegressor``.
early_stopping:
If ``True``, ``eval_set`` will be added to ``fit_params`` for each fold.
``early_stopping_rounds = 100`` will also be appended to fit_params if it does not already have one.
type_of_target:
The type of target variable. If ``auto``, type is inferred by ``sklearn.utils.multiclass.type_of_target``.
Otherwise, ``binary``, ``continuous``, or ``multiclass`` are supported.
Returns:
Namedtuple with following members
* oof_prediction (numpy array, shape (len(X_train),)):
The predicted value on put-of-Fold validation data.
* test_prediction (numpy array, hape (len(X_test),)):
The predicted value on test data. ``None`` if X_test is ``None``.
* scores (list of float, shape (nfolds+1,)):
``scores[i]`` denotes validation score in i-th fold.
``scores[-1]`` is the overall score. `None` if eval is not specified.
* importance (list of pandas DataFrame, shape (nfolds,)):
``importance[i]`` denotes feature importance in i-th fold model.
If the estimator is not GBDT, empty array is returned.
Example:
>>> from sklearn.datasets import make_regression
>>> from sklearn.linear_model import Ridge
>>> from sklearn.metrics import mean_squared_error
>>> from nyaggle.validation import cross_validate
>>> X, y = make_regression(n_samples=8)
>>> model = Ridge(alpha=1.0)
>>> pred_oof, pred_test, scores, _ = \
>>> cross_validate(model,
>>> X_train=X[:3, :],
>>> y=y[:3],
>>> X_test=X[3:, :],
>>> cv=3,
>>> eval_func=mean_squared_error)
>>> print(pred_oof)
[-101.1123267 , 26.79300693, 17.72635528]
>>> print(pred_test)
[-10.65095894 -12.18909059 -23.09906427 -17.68360714 -20.08218267]
>>> print(scores)
[71912.80290003832, 15236.680239881942, 15472.822033121925, 34207.43505768073]
"""
cv = check_cv(cv, y)
n_output_cols = 1
if type_of_target == 'auto':
type_of_target = multiclass.type_of_target(y)
if type_of_target == 'multiclass':
n_output_cols = y.nunique(dropna=True)
if isinstance(estimator, list):
assert len(estimator) == cv.get_n_splits(), "Number of estimators should be same to nfolds."
X_train = convert_input(X_train)
y = convert_input_vector(y, X_train.index)
if X_test is not None:
X_test = convert_input(X_test)
if not isinstance(estimator, list):
estimator = [estimator] * cv.get_n_splits()
assert len(estimator) == cv.get_n_splits()
if logger is None:
logger = getLogger(__name__)
def _predict(model: BaseEstimator, x: pd.DataFrame, _type_of_target: str):
if _type_of_target in ('binary', 'multiclass'):
if hasattr(model, "predict_proba"):
proba = model.predict_proba(x)
elif hasattr(model, "decision_function"):
warnings.warn('Since {} does not have predict_proba method, '
'decision_function is used for the prediction instead.'.format(type(model)))
proba = model.decision_function(x)
else:
raise RuntimeError('Estimator in classification problem should have '
'either predict_proba or decision_function')
if proba.ndim == 1:
return proba
else:
return proba[:, 1] if proba.shape[1] == 2 else proba
else:
return model.predict(x)
oof = np.zeros((len(X_train), n_output_cols)) if n_output_cols > 1 else np.zeros(len(X_train))
evaluated = np.full(len(X_train), False)
test = None
if X_test is not None:
test = np.zeros((len(X_test), n_output_cols)) if n_output_cols > 1 else np.zeros(len(X_test))
scores = []
eta_all = []
importance = []
for n, (train_idx, valid_idx) in enumerate(cv.split(X_train, y, groups)):
start_time = time.time()
train_x, train_y = X_train.iloc[train_idx], y.iloc[train_idx]
valid_x, valid_y = X_train.iloc[valid_idx], y.iloc[valid_idx]
if fit_params is None:
fit_params_fold = {}
elif callable(fit_params):
fit_params_fold = fit_params(n, train_idx, valid_idx)
else:
fit_params_fold = copy.copy(fit_params)
if is_gbdt_instance(estimator[n], ('lgbm', 'cat', 'xgb')):
if early_stopping:
if 'eval_set' not in fit_params_fold:
fit_params_fold['eval_set'] = [(valid_x, valid_y)]
if 'early_stopping_rounds' not in fit_params_fold:
fit_params_fold['early_stopping_rounds'] = 100
> estimator[n].fit(train_x, train_y, **fit_params_fold)
E TypeError: fit() got an unexpected keyword argument 'early_stopping_rounds'
nyaggle/validation/cross_validate.py:177: TypeError
======================================================================================== short test summary info =========================================================================================
FAILED tests/validation/test_cross_validate.py::test_cv_lgbm - TypeError: fit() got an unexpected keyword argument 'early_stopping_rounds'
=========================================================================================== 1 failed in 1.90s ============================================================================================
<\details>
ubels is only depend on StratifiedGroupKFold.
I think i can replace pandas.Series.groupby.
TimeSeriesSplit in sklearn does not meet a dataset in Kaggle in most case (it assumes fixed time intervals). We need a sklearn compatible, practical CV splitter for general time series problem.
experiment_gbdt can be general high-level experiment API with few changes.
mlflow raises error if length of key/value exceeds 250. If the length of gbdt parameters or cat_columns is long, experiment_gbdt will raise an exception.
Possible option:
nyaggle's TargetEncoder is basically a KFold version of category_encoders, with the same interface.
category_encoders changes the default parameters in scikit-learn-contrib/category_encoders#327, so it would be better to apply the same change for nyaggle. (This also causes some CI tests to fail)
Personally, I don't think this new default parameter is always good, but as long as the nyaggle's implementation of target encoder is a thin wrapper of category_encoders, I think interface consistency should be a priority.
nyaggle/nyaggle/validation/split.py
Line 69 in d5dd9cf
This split
should be folds
?
Wouldn't it be better to match the Python version supported by the library(XGBoost/LightGBM)?
Breaking: XGBoost Python package now requires Python 3.6 and later (#5715) -
This code is recorded the score of the target variable transformed by np.log1p
.
train, test = load_dataset()
target_col = "y"
submit = make_sample_submission(test, target_col)
target = train[target_col]
target = target.map(np.log1p)
train.drop(columns=[target_col], inplace=True)
lightgbm_params = {
"metric": "rmse",
"objective": 'regression',
"max_depth": 5,
"num_leaves": 24,
"learning_rate": 0.007,
"n_estimators": 30000,
"min_child_samples": 80,
"subsample": 0.8,
"colsample_bytree": 1,
"reg_alpha": 0,
"reg_lambda": 0,
}
fit_params = {
"early_stopping_rounds": 100,
"verbose": 5000
}
kf = KFold(n_splits=4)
lgb_result = run_experiment(lightgbm_params,
X_train=train,
y=target,
X_test=test,
eval_func=rmse,
cv=kf,
fit_params=fit_params,
logging_directory='resources/logs/'
'lightgbm/{time}',
sample_submission=submit)
Unlike the current implementation of cv in nyaggle, The models trained in lgb.cv and xgb.cv have an equal number of trees in all folds.
Since these “balanced” models may work better when the number of data is small, we sometimes want to extract the trained models from lgb.cv or xgb.cv and use them for test data.
So it would be useful to have the option to use these cv functions in nyaggle's run_experiment and cross_validate as well.
ref:
https://blog.amedama.jp/entry/lightgbm-cv-model
https://blog.amedama.jp/entry/xgboost-cv-model
This is not an issue. I found you use get_temp_directory
in the tests to create a temporary directory. Is there any reason that you don't use the tmpdir
fixture in pytest?
https://docs.pytest.org/en/latest/tmpdir.html#the-tmpdir-fixture
Python 3.9.10 (tags/v3.9.10:f2f3f53, Jan 17 2022, 15:14:21) [MSC v.1929 64 bit (AMD64)] on win32
pytestを実行したところ次のエラーメッセージが出力されました。
> pytest
========================================================================= short test summary info =========================================================================
FAILED tests/feature/test_groupby.py::test_return_type_by_aggregation - ValueError: Supported types are: <class 'str'> or typing.Callable. Got <class 'numpy._ArrayFunctionDispatcher'> instead.
FAILED tests/feature/nlp/test_bert.py::test_bert_jp - requests.exceptions.ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out.
======================================================== 2 failed, 106 passed, 1084 warnings in 332.27s (0:05:32) =========================================================
_____________________________________________________________________ test_return_type_by_aggregation _____________________________________________________________________
iris_dataframe = ( sl sw pl pw species
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3... 3.4 5.4 2.3 2.0
149 5.9 3.0 5.1 1.8 2.0
[150 rows x 5 columns], 'species', ['sl', 'sw', 'pl', 'pw'])
def test_return_type_by_aggregation(iris_dataframe):
df, group_key, group_values = iris_dataframe
agg_methods = ["max", np.sum, custom_function]
new_df, new_cols = aggregation(df, group_key, group_values,
agg_methods)
tests\feature\test_groupby.py:27:
input_df = sl sw pl pw species
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 ... 6.5 3.0 5.2 2.0 2.0
148 6.2 3.4 5.4 2.3 2.0
149 5.9 3.0 5.1 1.8 2.0
[150 rows x 5 columns]
group_key = 'species', group_values = ['sl', 'sw', 'pl', 'pw']
agg_methods = ['max', <function sum at 0x000002226B6C2CF0>, <function custom_function at 0x000002221318C280>]
def aggregation(
input_df: pd.DataFrame,
group_key: str,
group_values: List[str],
agg_methods: List[Union[str, FunctionType]],
) -> Tuple[pd.DataFrame, List[str]]:
"""
Aggregate values after grouping table rows by a given key.
Args:
input_df:
Input data frame.
group_key:
Used to determine the groups for the groupby.
group_values:
Used to aggregate values for the groupby.
agg_methods:
List of function or function names, e.g. ['mean', 'max', 'min', numpy.mean].
Do not use a lambda function because the name attribute of the lambda function cannot generate a unique string of column names in <lambda>.
Returns:
Tuple of output dataframe and new column names.
"""
new_df = input_df.copy()
new_cols = []
for agg_method in agg_methods:
if _is_lambda_function(agg_method):
raise ValueError('Not supported lambda function.')
elif isinstance(agg_method, str):
pass
elif isinstance(agg_method, FunctionType):
pass
else:
raise ValueError('Supported types are: {} or {}.'
' Got {} instead.'.format(str, Callable, type(agg_method)))
E ValueError: Supported types are: <class 'str'> or typing.Callable. Got <class 'numpy._ArrayFunctionDispatcher'> instead.
nyaggle\feature\groupby.py:89: ValueError
テストコードではaggregationの引数agg_methodsの期待としてnumpy.sum
が渡されています。
nyaggle/tests/feature/test_groupby.py
Lines 24 to 30 in 44b0169
aggregationの引数agg_methodsの期待として以下の3つのみサポートされていますが
numpy.sumのクラスは<class 'numpy._ArrayFunctionDispatcher'>
であるため、if文ではじかれるようになっています。
nyaggle/nyaggle/feature/groupby.py
Lines 81 to 90 in 44b0169
AttributeError
occurs when TargetEncoder
object passed to str()
function.
This behavior has the potential to become an issue on debugging etc.
The steps to reproduce are the following.
>>> from nyaggle.feature.category_encoder.target_encoder import TargetEncoder
>>> encoder = TargetEncoder()
>>> str(encoder)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/amedama/.virtualenvs/py310/lib/python3.10/site-packages/sklearn/base.py", line 279, in __repr__
repr_ = pp.pformat(self)
File "/usr/local/Cellar/[email protected]/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pprint.py", line 157, in pformat
self._format(object, sio, 0, 0, {}, 0)
File "/usr/local/Cellar/[email protected]/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pprint.py", line 174, in _format
rep = self._repr(object, context, level)
File "/usr/local/Cellar/[email protected]/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pprint.py", line 454, in _repr
repr, readable, recursive = self.format(object, context.copy(),
File "/Users/amedama/.virtualenvs/py310/lib/python3.10/site-packages/sklearn/utils/_pprint.py", line 189, in format
return _safe_repr(
File "/Users/amedama/.virtualenvs/py310/lib/python3.10/site-packages/sklearn/utils/_pprint.py", line 440, in _safe_repr
params = _changed_params(object)
File "/Users/amedama/.virtualenvs/py310/lib/python3.10/site-packages/sklearn/utils/_pprint.py", line 93, in _changed_params
params = estimator.get_params(deep=False)
File "/Users/amedama/.virtualenvs/py310/lib/python3.10/site-packages/sklearn/base.py", line 211, in get_params
value = getattr(self, key)
AttributeError: 'TargetEncoder' object has no attribute 'cols'
TargetEncoder
takes 'cols' parameter but doesn't save to the attribute.
(That transfers to category_encoders.TargetEncoder
internally)
But scikit-learn BaseEstimator#get_params()
expects that all __init__()
parameters will be saved.
So accessing 'cols' raises AttributeError
.
$ python -V
Python 3.10.9
$ pip list | egrep -i "(nyaggle|scikit-learn)"
nyaggle 0.1.5
scikit-learn 1.2.0
こんにちは、@wakame1367です。
現在、PRをマージする権限はリポジトリの作成者にのみ与えられているようです。現状@nyanp さんのみがマージ権限を持つと@nyanpさんの負担が大きいと考えております。解決策としてコントリビューターにもその権限を与えると、負担を減らし、かつ開発の効率を向上させると考えています。
またPRを長期間放置してしまっているのも気になっております。
例えば次のPR
#96
そこで提案ですが、信頼できるコントリビューターをリポジトリのCollaboratorとして追加することはいかがでしょうか? Collaboratorとして追加されたコントリビューターは、PRのマージ権限を含む追加の権限を得ることができます。
次は具体的なCollaboratorとして追加する方法の紹介です。参考までに。
GitHubにおけるPull Request(PR)のマージ権限を他のユーザーに与えるには、そのユーザーを該当リポジトリの"Collaborator"か"Team Member"として追加する必要があります。以下はその手順です:
Collaboratorとして追加する
The CI in #96 is failing on Python 3.6, Python 3.6 ~ 3.7 should be dropped from support and removed from CI as it is already EOL.
It would be nice if run_experiment accepts dictionary to customize layout of logging artifacts
Collecting hyperparameters used in Kaggle top solutions and providing API to use them seems good starting point of a new competition.
I often use preemptive instance.
So, I want to resume learning if the instance is shut down.
nyaggle
can resume learning ? If not now, is nyaggle
planning to implement it?
Support simple stacked generalization from multiple experiments.
A number of flake8 errors were logged during the check by CI.
While not fatal, it is desirable to fix them.
nyaggle should not install xgboost, lightgbm and catboost automatically to the user environment.
see also: shap.
このリポジトリへのPRやIssue、例えばバグ報告の際にどのようなフォーマットで報告してほしいか、バグを見つけてPRを出すときにどのようにテスト環境を構築すればよいかなどを明記したCONTRIBUTING.md
を書いたほうがよいと思いました。
書く内容以下のような項目かと思いました。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.