ray-project / tune-sklearn Goto Github PK

A drop-in replacement for Scikit-Learn’s GridSearchCV / RandomizedSearchCV -- but with cutting edge hyperparameter tuning techniques.

Home Page: https://docs.ray.io/en/master/tune/api_docs/sklearn.html

License: Apache License 2.0

Python 97.46% Shell 2.54%

scikit-learn hyperparameter-tuning gridsearchcv bayesian-optimization automl

tune-sklearn's Introduction

tune-sklearn

Tune-sklearn is a drop-in replacement for Scikit-Learn’s model selection module (GridSearchCV, RandomizedSearchCV) with cutting edge hyperparameter tuning techniques.

⚠️ `tune-sklearn` is no longer being maintained

The latest release 0.5.0 is the last version of the library that will be released by the Ray team, and it is compatible with ray>=2.7.x, ray<=2.9.x. The library will not be guaranteed to work with future Ray versions.

The recommended alternative to keep up with the latest version of Ray is to migrate tune-sklearn usage to the Ray Tune APIs to accomplish the same thing.

Feel free to post an issue on the Ray Github if you run into any issues in migrating.

Features

Here’s what tune-sklearn has to offer:

Consistency with Scikit-Learn API: Change less than 5 lines in a standard Scikit-Learn script to use the API [example].
Modern tuning techniques: tune-sklearn allows you to easily leverage Bayesian Optimization, HyperBand, BOHB, and other optimization techniques by simply toggling a few parameters.
Framework support: tune-sklearn is used primarily for tuning Scikit-Learn models, but it also supports and provides examples for many other frameworks with Scikit-Learn wrappers such as Skorch (Pytorch) [example], KerasClassifier (Keras) [example], and XGBoostClassifier (XGBoost) [example].
Scale up: Tune-sklearn leverages Ray Tune, a library for distributed hyperparameter tuning, to parallelize cross validation on multiple cores and even multiple machines without changing your code.

Check out our API Documentation and Walkthrough (for master branch).

Installation

Dependencies

numpy (>=1.16)
ray (>=2.7.0)
scikit-learn (>=0.23)

User Installation

pip install tune-sklearn ray[tune]

pip install -U git+https://github.com/ray-project/tune-sklearn.git && pip install 'ray[tune]'

Tune-sklearn Early Stopping

For certain estimators, tune-sklearn can also immediately enable incremental training and early stopping. Such estimators include:

Estimators that implement 'warm_start' (except for ensemble classifiers and decision trees)
Estimators that implement partial fit
XGBoost, LightGBM and CatBoost models (via incremental learning)

To read more about compatible scikit-learn models, see scikit-learn's documentation at section 8.1.1.3.

Early stopping algorithms that can be enabled include HyperBand and Median Stopping (see below for examples).

If the estimator does not support partial_fit, a warning will be shown saying early stopping cannot be done and it will simply run the cross-validation on Ray's parallel back-end.

Apart from early stopping scheduling algorithms, tune-sklearn also supports passing custom stoppers to Ray Tune. These can be passed via the stopper argument when instantiating TuneSearchCV or TuneGridSearchCV. See the Ray documentation for an overview of available stoppers.

Examples

TuneGridSearchCV

To start out, it’s as easy as changing our import statement to get Tune’s grid search cross validation interface, and the rest is almost identical!

TuneGridSearchCV accepts dictionaries in the format { param_name: str : distribution: list } or a list of such dictionaries, just like scikit-learn's GridSearchCV. The distribution can also be the output of Ray Tune's tune.grid_search.

# from sklearn.model_selection import GridSearchCV
from tune_sklearn import TuneGridSearchCV

# Other imports
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

# Set training and validation sets
X, y = make_classification(n_samples=11000, n_features=1000, n_informative=50, n_redundant=0, n_classes=10, class_sep=2.5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)

# Example parameters to tune from SGDClassifier
parameters = {
    'alpha': [1e-4, 1e-1, 1],
    'epsilon':[0.01, 0.1]
}

tune_search = TuneGridSearchCV(
    SGDClassifier(),
    parameters,
    early_stopping="MedianStoppingRule",
    max_iters=10
)

import time # Just to compare fit times
start = time.time()
tune_search.fit(X_train, y_train)
end = time.time()
print("Tune Fit Time:", end - start)
pred = tune_search.predict(X_test)
accuracy = np.count_nonzero(np.array(pred) == np.array(y_test)) / len(pred)
print("Tune Accuracy:", accuracy)

If you'd like to compare fit times with sklearn's GridSearchCV, run the following block of code:

from sklearn.model_selection import GridSearchCV
# n_jobs=-1 enables use of all cores like Tune does
sklearn_search = GridSearchCV(
    SGDClassifier(),
    parameters,
    n_jobs=-1
)

start = time.time()
sklearn_search.fit(X_train, y_train)
end = time.time()
print("Sklearn Fit Time:", end - start)
pred = sklearn_search.predict(X_test)
accuracy = np.count_nonzero(np.array(pred) == np.array(y_test)) / len(pred)
print("Sklearn Accuracy:", accuracy)

TuneSearchCV

TuneSearchCV is an upgraded version of scikit-learn's RandomizedSearchCV.

It also provides a wrapper for several search optimization algorithms from Ray Tune's searchers, which in turn are wrappers for other libraries. The selection of the search algorithm is controlled by the search_optimization parameter. In order to use other algorithms, you need to install the libraries they depend on (pip install column). The search algorithms are as follows:

Algorithm	`search_optimization` value	Summary	Website	`pip install`
(Random Search)	`"random"`	Randomized Search		built-in
SkoptSearch	`"bayesian"`	Bayesian Optimization	[Scikit-Optimize]	`scikit-optimize`
HyperOptSearch	`"hyperopt"`	Tree-Parzen Estimators	[HyperOpt]	`hyperopt`
TuneBOHB	`"bohb"`	Bayesian Opt/HyperBand	[BOHB]	`hpbandster ConfigSpace`
Optuna	`"optuna"`	Tree-Parzen Estimators	[Optuna]	`optuna`

All algorithms other than RandomListSearcher accept parameter distributions in the form of dictionaries in the format { param_name: str : distribution: tuple or list }.

Tuples represent real distributions and should be two-element or three-element, in the format (lower_bound: float, upper_bound: float, Optional: "uniform" (default) or "log-uniform"). Lists represent categorical distributions. Ray Tune Search Spaces are also supported and provide a rich set of potential distributions. Search spaces allow for users to specify complex, potentially nested search spaces and parameter distributions. Furthermore, each algorithm also accepts parameters in their own specific format. More information in Tune documentation.

Random Search (default) accepts dictionaries in the format { param_name: str : distribution: list } or a list of such dictionaries, just like scikit-learn's RandomizedSearchCV.

from tune_sklearn import TuneSearchCV

# Other imports
import scipy
from ray import tune
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

# Set training and validation sets
X, y = make_classification(n_samples=11000, n_features=1000, n_informative=50, n_redundant=0, n_classes=10, class_sep=2.5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)

# Example parameter distributions to tune from SGDClassifier
# Note the use of tuples instead if non-random optimization is desired
param_dists = {
    'loss': ['squared_hinge', 'hinge'], 
    'alpha': (1e-4, 1e-1, 'log-uniform'),
    'epsilon': (1e-2, 1e-1)
}

bohb_tune_search = TuneSearchCV(SGDClassifier(),
    param_distributions=param_dists,
    n_trials=2,
    max_iters=10,
    search_optimization="bohb"
)

bohb_tune_search.fit(X_train, y_train)

# Define the `param_dists using the SearchSpace API
# This allows the specification of sampling from discrete and 
# categorical distributions (below for the `learning_rate` scheduler parameter)
param_dists = {
    'loss': tune.choice(['squared_hinge', 'hinge']),
    'alpha': tune.loguniform(1e-4, 1e-1),
    'epsilon': tune.uniform(1e-2, 1e-1),
}


hyperopt_tune_search = TuneSearchCV(SGDClassifier(),
    param_distributions=param_dists,
    n_trials=2,
    early_stopping=True, # uses Async HyperBand if set to True
    max_iters=10,
    search_optimization="hyperopt"
)

hyperopt_tune_search.fit(X_train, y_train)

Other Machine Learning Libraries and Examples

Tune-sklearn also supports the use of other machine learning libraries such as Pytorch (using Skorch) and Keras. You can find these examples here:

Documentation

See the auto-generated docs here.

These are generated by lazydocs and should be updated on every release:

pip install lazydocs
lazydocs /path/to/tune-sklearn/tune-sklearn --src-base-url="https://github.com/ray-project/tune-sklearn/blob/master" --overview-file="README.md"

More information

Ray Tune

tune-sklearn's People

Contributors

Stargazers

Watchers

tune-sklearn's Issues

Properly raise the right error/exception when training fails

When an error happens in _trainable, we don't explicitly raise errors; we rely on Ray Tune to log the error so it shows up in logging, and catch the TuneError that's raised. Looks like this is making errors hard to understand, especially when logging is off.

#107

For example:

import numpy as np
from tune_sklearn import TuneGridSearchCV
from sklearn.linear_model import Ridge, SGDClassifier
parameter_grid = {"alpha": [1e-4, 1e-1, 1], "epsilon": [0.01, 0.1]}


X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1], [2, 1]])
y = np.array([1, 1, 2, 2, 2])


tune_search = TuneGridSearchCV(
    SGDClassifier(), parameter_grid, scoring="f1_micro", max_iters=20)
tune_search.fit(X, y)

/Users/rliaw/dev/tune-sklearn/tune_sklearn/tune_basesearch.py:391: UserWarning: max_iters is set > 1 but incremental/partial training is not enabled. To enable partial training, ensure the estimator has `partial_fit` or `warm_start` and set `early_stopping=True`. Automatically setting max_iters=1.
  category=UserWarning)
File descriptor limit 256 is too low for production servers and may result in connection errors. At least 8192 is recommended. --- Fix with 'ulimit -n 8192'
Trial _Trainable_929b0_00005: Error processing event.
Traceback (most recent call last):
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 726, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 489, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/worker.py", line 1473, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::_Trainable.train() (pid=89931, ip=192.168.1.115)
  File "/Users/rliaw/miniconda3/lib/python3.7/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

ray::_Trainable.train() (pid=89931, ip=192.168.1.115)
  File "python/ray/_raylet.pyx", line 472, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 431, in ray._raylet.execute_task.function_executor
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 336, in train
    result = self.step()
  File "/Users/rliaw/dev/tune-sklearn/tune_sklearn/_trainable.py", line 106, in step
    return self._train()
  File "/Users/rliaw/dev/tune-sklearn/tune_sklearn/_trainable.py", line 247, in _train
    return_train_score=self.return_train_score,
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 236, in cross_validate
    for train, test in cv.split(X, y, groups))
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/joblib/parallel.py", line 1029, in __call__
    if self.dispatch_one_batch(iterator):
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/joblib/parallel.py", line 819, in dispatch_one_batch
    islice = list(itertools.islice(iterator, big_batch_size))
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 231, in <genexpr>
    delayed(_fit_and_score)(
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 335, in split
    for train, test in super().split(X, y, groups):
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 80, in split
    for test_index in self._iter_test_masks(X, y, groups):
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 692, in _iter_test_masks
    test_folds = self._make_test_folds(X, y)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 663, in _make_test_folds
    % (self.n_splits))
ValueError: n_splits=5 cannot be greater than the number of members in each class.

Note that this does not actually raise the error but instead logs it.
Here is the sklearn version for reference:

(base) ➜  tune-sklearn git:(master) ✗ python _test.py
Traceback (most recent call last):
  File "_test.py", line 14, in <module>
    tune_search.fit(X, y)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py", line 710, in fit
    self._run_search(evaluate_candidates)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py", line 1151, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py", line 689, in evaluate_candidates
    cv.split(X, y, groups)))
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 335, in split
    for train, test in super().split(X, y, groups):
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 80, in split
    for test_index in self._iter_test_masks(X, y, groups):
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 692, in _iter_test_masks
    test_folds = self._make_test_folds(X, y)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 663, in _make_test_folds
    % (self.n_splits))
ValueError: n_splits=5 cannot be greater than the number of members in each class.

[examples] Add colab notebook tutorials

It will be easy to showcase tune-sklearn usage and tips if we provide a couple jupyter notebooks.

[Bug] ConfigurationSpace throws error with BOHB

Passing a ConfigurationSpace after setting search_optimization='bohb' in TuneSearchCV throws the following error:

/usr/local/lib/python3.6/dist-packages/tune_sklearn/tune_basesearch.py in fit(self, X, y, groups, **fit_params)
    383                               "To show process output, set verbose=2.")
    384 
--> 385             result = self._fit(X, y, groups, **fit_params)
    386 
    387             if not ray_init and ray.is_initialized():

/usr/local/lib/python3.6/dist-packages/tune_sklearn/tune_basesearch.py in _fit(self, X, y, groups, **fit_params)
    331         config["n_jobs"] = self.sk_n_jobs
    332 
--> 333         self._fill_config_hyperparam(config)
    334         analysis = self._tune_run(config, resources_per_trial)
    335 

/usr/local/lib/python3.6/dist-packages/tune_sklearn/tune_search.py in _fill_config_hyperparam(self, config)
    336         samples = 1
    337         all_lists = True
--> 338         for key, distribution in self.param_distributions.items():
    339             if isinstance(distribution, list):
    340                 import random

AttributeError: 'ConfigurationSpace' object has no attribute 'items'

List of Dictionaries for Grid/Randomized Search

List of dictionaries not supported for TuneGridSearchCV/TuneRandomizedSearchCV, which caused the error to occur in the pipeline example added in #14, not because we couldn't support parameters for Pipelines. We are able to support Pipelines as shown in #14.

Early stopping for xgboost and lgbm

Xgboost early stopping is actually slightly slower than not early stopping at the moment (might be because loading in the model is inefficient?)

lgbm early stopping can't be tested right now because init_model is not in the stable version and installing from source isn't working on travis for some reason.

[docs/feature] Show how to tune categorical variables + integers

Right now the default API doesn't actually support categorical/integer formats. It would be good to automatically shim into it. Perhaps this could be done via ray-project/ray#10401

cv_results lineup issue

In running more experiments, we discovered the results from tune_sklearn and sklearn do not line up. The params printed out are in the correct order, but the results shown in the scores do not correspond

TuneGridSearchCV.fit(X, Y) would not run

Hi I am using TuneGridSearchCV.fit below as example and are getting the following error.

X, y = make_classification(n_samples=11000, n_features=1000, n_informative=50,
n_redundant=0, n_classes=10, class_sep=2.5)

TuneGridSearchCV.fit(X, y)

TuneGridSearchCV.fit(X, y)
Traceback (most recent call last):

File "", line 1, in
TuneGridSearchCV.fit(X, y)

File "C:\Users\myusername\AppData\Local\Programs\anaconda3\lib\site-packages\tune_sklearn\tune_basesearch.py", line 385, in fit
result = self._fit(X, y, groups, **fit_params)

AttributeError: 'numpy.ndarray' object has no attribute '_fit'

Metric names issue tune_sklearn.tune_basesearch.TuneBaseSearchCV._format_results

In tune_sklearn/tune_basesearch.py:603, the way to get the column names of the given metric names aren't robust for customized scoring function and metric name, see below
df[[ col for col in dfs[0].columns if "split" in col and "test_%s" % name in col ]].to_numpy() for df in finished
If metric names are 'metric 1', 'metric 11', the result of metric 11 will also be added in 'metric 1'.
In this case, maybe use col.endswith("test_%s" % name) is better.

TuneError. Any idea?

I'm getting an error as soon as I try calling the fit method:

TuneError                                 Traceback (most recent call last)
<ipython-input-141-9038f3bc751b> in <module>
     37     kernel_initializer=kernel_initializer)
     38 grid = TuneGridSearchCV(estimator=model, param_grid=param_grid)
---> 39 grid_result = grid.fit(X, y)
     40 print(grid_result.best_params_)
     41 print(grid_result.cv_results_)

~\Anaconda3\envs\PythonGPU\lib\site-packages\tune_sklearn\tune_basesearch.py in fit(self, X, y, groups, **fit_params)
    366                 ray.init(ignore_reinit_error=True, configure_logging=False)
    367 
--> 368             result = self._fit(X, y, groups, **fit_params)
    369 
    370             if not ray_init and ray.is_initialized():

~\Anaconda3\envs\PythonGPU\lib\site-packages\tune_sklearn\tune_basesearch.py in _fit(self, X, y, groups, **fit_params)
    320 
    321         self._fill_config_hyperparam(config)
--> 322         analysis = self._tune_run(config, resources_per_trial)
    323 
    324         self.cv_results_ = self._format_results(self.n_splits, analysis)

~\Anaconda3\envs\PythonGPU\lib\site-packages\tune_sklearn\tune_gridsearch.py in _tune_run(self, config, resources_per_trial)
    203                 config=config,
    204                 checkpoint_at_end=True,
--> 205                 resources_per_trial=resources_per_trial)
    206 
    207         return analysis

~\Anaconda3\envs\PythonGPU\lib\site-packages\ray\tune\tune.py in run(run_or_experiment, name, stop, config, resources_per_trial, num_samples, local_dir, upload_dir, trial_name_creator, loggers, sync_to_cloud, sync_to_driver, checkpoint_freq, checkpoint_at_end, sync_on_checkpoint, keep_checkpoints_num, checkpoint_score_attr, global_checkpoint_period, export_formats, max_failures, fail_fast, restore, search_alg, scheduler, with_server, server_port, verbose, progress_reporter, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, return_trials, ray_auto_init)
    347     if incomplete_trials:
    348         if raise_on_failed_trial:
--> 349             raise TuneError("Trials did not complete", incomplete_trials)
    350         else:
    351             logger.error("Trials did not complete: %s", incomplete_trials)

TuneError: ('Trials did not complete', [_Trainable_edcaa_00000, _Trainable_edcaa_00001, _Trainable_edcaa_00002, _Trainable_edcaa_00003, _Trainable_edcaa_00004, _Trainable_edcaa_00005, _Trainable_edcaa_00006, _Trainable_edcaa_00007])

[Request] Add examples for deploying on AWS spot instances

Is it possible to deploy tune-sklearn on AWS Spot instances similar to this? If so some code examples for the same would be useful

GPU support

Is it easy to enable GPU support in tune-sklearn for examples such as sklearn or xgboost?

TuneSearchCV with EarlyStopping ValueError: sample_weight.shape == (190,), expected (271,)!

from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor

### parameters
# parameter_grid = {"C": [0.2, 1.0, 5.0], "epsilon": [0.02, 0.1, 0.5]}
regr = SGDRegressor(verbose=True)
parameter_grid = {"loss": ["squared_loss", "huber"],"penalty": ['l1', 'l2'], "learning_rate": ['optimal', 'invscaling', 'adaptive']}
### tune-sklearn
from ray.tune.sklearn import TuneGridSearchCV, TuneSearchCV

tune_search = TuneSearchCV(
    regr,
    parameter_grid,
    search_optimization="bayesian",
    n_trials=3,
    early_stopping=True,
# If I set early_stopping=False, it has no problem
    max_iters=10,
)
tune_search.fit(X_train_scaled, y_train_scaled)

The error says

/usr/local/envs/test/lib/python3.8/site-packages/tune_sklearn/tune_basesearch.py:269: UserWarning: max_iters is set > 1 but incremental/partial training is not enabled. To enable partial training, ensure the estimator has `partial_fit` or `warm_start` and set `early_stopping=True`. Automatically setting max_iters=1.
  warnings.warn(
WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 66592768 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Trial _Trainable_9f558c4c: Error processing event.
Traceback (most recent call last):
  File "/usr/local/envs/test/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 515, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/envs/test/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 488, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/envs/test/lib/python3.8/site-packages/ray/worker.py", line 1428, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::_Trainable.train() (pid=7541, ip=172.17.0.3)
  File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
  File "/usr/local/envs/test/lib/python3.8/site-packages/ray/tune/trainable.py", line 336, in train
    result = self.step()
  File "/usr/local/envs/test/lib/python3.8/site-packages/tune_sklearn/_trainable.py", line 119, in step
    return self._train()
  File "/usr/local/envs/test/lib/python3.8/site-packages/tune_sklearn/_trainable.py", line 178, in _train
    self._early_stopping_partial_fit(i, estimator, X_train,
  File "/usr/local/envs/test/lib/python3.8/site-packages/tune_sklearn/_trainable.py", line 125, in _early_stopping_partial_fit
    estimator.partial_fit(X_train, y_train, np.unique(self.y))
  File "/usr/local/envs/test/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py", line 1181, in partial_fit
    return self._partial_fit(X, y, self.alpha, C=1.0,
  File "/usr/local/envs/test/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py", line 1136, in _partial_fit
    sample_weight = _check_sample_weight(sample_weight, X)
  File "/usr/local/envs/test/lib/python3.8/site-packages/sklearn/utils/validation.py", line 1302, in _check_sample_weight
    raise ValueError("sample_weight.shape == {}, expected {}!"
ValueError: sample_weight.shape == (190,), expected (271,)!

The data is Boston.

[Feature request] Early stopping doesn't work with XGBoost, LightGBM or CatBoost

I'm getting the following error while setting early_stopping=True

ValueError: Early stopping is not supported because the estimator does not have `partial_fit`

These could be a potential fixes:

[Feature Request] Add default model parameters to tuning trials

While testing on multiple datasets I've observed that if n_iter is low, the tuned model has lower performance for both train and test sets compared to the model using default parameters. Is there a way to ensure that the default parameters of a model are tried out as the first trial and it starts optimizing from there? If there's no improvement, there's no need to run all the trials

Add reasoning/explanation for separate Repo

People might wonder/ask why tune-sklearn is a separate repo.

The main reasons:

very flat
low ci overhead
filtered issues

Rename n_iter => n_trials

It is confusing that n_iter and max_iter refer to two different things.

cc @inventormc

`refit=True` is necessary to get `best_params_`

If refit param is set to False, there is no way to get the best params.

[Feature Request] Stop tuning upon optimization convergence

Is it possible enable early stopping to any algorithm that does not have partial_fit (eg. LogisticRegression or RandomForest) just by looking at the train and test (CV score) score progression across the trials?

Problems with the TuneSearchCV example

Hello,

running the example:

`
from tune_sklearn import TuneSearchCV
import scipy
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from src.misc.skopt.space import Real, Categorical, Integer

X, y = make_classification(n_samples=11000, n_features=1000, n_informative=50, n_redundant=0, n_classes=10, class_sep=2.5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)

param_dists = {
'alpha': (1e-4, 1e-1),
'epsilon': (1e-2, 1e-1)
}

bohb_tune_search = TuneSearchCV(SGDClassifier(),
param_distributions=param_dists,
n_iter=2,
max_iters=10,
search_optimization="bohb"
)

bohb_tune_search.fit(X_train, y_train)

hyperopt_tune_search = TuneSearchCV(SGDClassifier(),
param_distributions=param_dists,
n_iter=2,
early_stopping=True, # uses ASHAScheduler if set to True
max_iters=10,
search_optimization="hyperopt"
)

hyperopt_tune_search.fit(X_train, y_train)

bayesian_tune_search = TuneSearchCV(SGDClassifier(),
param_distributions=param_dists,
n_iter=2,
early_stopping=True, # uses ASHAScheduler if set to True
max_iters=10,
search_optimization="bayesian"
)

bayesian_tune_search.fit(X_train, y_train)
`

I get the following errors:

Using 'bohb': ValueError: Search optimization must be random or bayesian
Using 'hyperopt': Search optimization must be random or bayesian
Using 'bayesian': TypeError: '<' not supported between instances of 'Version' and 'tuple'

I installed all packages, also the ones from the column of the table.

Do you need anything else to get me going? Currently I am not able to run anything.

Thank you very much

[Question] Tune Sklearn Algorithm

I have a couple of questions around how tune-sklearn works:

When I set n_iter=10 and max_iters=10 using BOHB, I see that the hyperparameters for the 10 trials are sampled in the first trial itself and they remain the same in all 10 trials. Aren't the hyperparameters of the later trials supposed to change dynamically based on the results from the earlier trials instead?
When setting n_iter=10 and cv=3 does that mean the model is run 30 times with 3 cross-validation models run per trial? In that case isn't it much more efficient to check the test score on only one of the folds and if the score is too low, discard the entire trial without running the other 2 folds and picking another trial instead?
How does early stopping work using max_iters? What is the stopping condition? And how is it different from using one of the schedulers like ASHA from Ray Tune?
I tried setting both n_iter=50 and max_iters=50 but the log only shows 10 trials. Why is that?
On Google Colab, I set n_jobs=-1 for LightGBM and n_jobs=-1 and sk_n_jobs=-1 for tune-sklearn but Resources requested shows 1/2 CPUs used. How are the different types n_jobs used in this scenario and why isn't it using both CPU cores when setting it to -1?

[Bug] Tuning CatBoost with GPU fails on Google Colab

Running the following code on Google Colab with a GPU hardware accelerator:

from catboost import CatBoostClassifier
from tune_sklearn import TuneSearchCV
import ray
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load breast cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

model = CatBoostClassifier(task_type="GPU", logging_level="Silent")
param_dists = {
    "iterations": (500, 600)
}

ray.init(webui_host="0.0.0.0")
gs = TuneSearchCV(model, param_dists, n_iter=2, scoring="accuracy",
                  search_optimization="bayesian")
gs.fit(X_train, y_train)
print(gs.cv_results_)

pred = gs.predict(X_test)
correct = 0
for i in range(len(y_test)):
    if pred[i] == y_test[i]:
        correct += 1
print("Accuracy:", correct / len(pred))

throws the following error:

2020-08-13 06:42:21,410	INFO resource_spec.py:212 -- Starting Ray with 7.13 GiB memory available for workers and up to 3.58 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-08-13 06:42:21,672	WARNING services.py:923 -- Redis failed to start, retrying now.
2020-08-13 06:42:21,873	INFO services.py:1165 -- View the Ray dashboard at 172.28.0.2:8265
/usr/local/lib/python3.6/dist-packages/tune_sklearn/tune_basesearch.py:249: UserWarning: Early stopping is not enabled. To enable early stopping, pass in a supported scheduler from Tune and ensure the estimator has `partial_fit`.
  warnings.warn("Early stopping is not enabled. "
2020-08-13 06:42:30,803	INFO logger.py:271 -- Removed the following hyperparameter values when logging to tensorboard: {'iterations': 510}
2020-08-13 06:42:30,867	INFO logger.py:271 -- Removed the following hyperparameter values when logging to tensorboard: {'iterations': 586}
(pid=2326) /usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
(pid=2326) _catboost.CatBoostError: catboost/cuda/cuda_lib/cuda_base.h:281: CUDA error 100: no CUDA-capable device is detected
(pid=2326) 
(pid=2326)   FitFailedWarning)
(pid=2326) /usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
(pid=2326) _catboost.CatBoostError: catboost/cuda/cuda_lib/cuda_manager.cpp:201: Condition violated: `State == nullptr'
(pid=2326) 
(pid=2326)   FitFailedWarning)
(pid=2326) /usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
(pid=2326) _catboost.CatBoostError: catboost/cuda/cuda_lib/cuda_base.h:281: CUDA error 100: no CUDA-capable device is detected
(pid=2326) 
(pid=2326)   FitFailedWarning)
(pid=2326) /usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
(pid=2326) _catboost.CatBoostError: catboost/cuda/cuda_lib/cuda_manager.cpp:201: Condition violated: `State == nullptr'
(pid=2326) 
(pid=2326)   FitFailedWarning)
(pid=2327) /usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
(pid=2327) _catboost.CatBoostError: catboost/cuda/cuda_lib/cuda_base.h:281: CUDA error 100: no CUDA-capable device is detected
(pid=2327) 
(pid=2327)   FitFailedWarning)
(pid=2327) /usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
(pid=2327) _catboost.CatBoostError: catboost/cuda/cuda_lib/cuda_manager.cpp:201: Condition violated: `State == nullptr'
(pid=2327) 
(pid=2327)   FitFailedWarning)
(pid=2327) /usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
(pid=2327) _catboost.CatBoostError: catboost/cuda/cuda_lib/cuda_base.h:281: CUDA error 100: no CUDA-capable device is detected
(pid=2327) 
(pid=2327)   FitFailedWarning)
(pid=2327) /usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
(pid=2327) _catboost.CatBoostError: catboost/cuda/cuda_lib/cuda_manager.cpp:201: Condition violated: `State == nullptr'
(pid=2327) 
(pid=2327)   FitFailedWarning)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-beac382c1f80> in <module>()
     21 gs = TuneSearchCV(model, param_dists, n_iter=2, scoring="accuracy",
     22                   search_optimization="bayesian")
---> 23 gs.fit(X_train, y_train)
     24 print(gs.cv_results_)
     25 

5 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1491             key = item_from_zerodim(key)
   1492             if not is_integer(key):
-> 1493                 raise TypeError("Cannot index by location index with a non-integer key")
   1494 
   1495             # validate the location

TypeError: Cannot index by location index with a non-integer key

while running it normally works fine:

model = CatBoostClassifier(task_type="GPU", logging_level="Silent")
model.fit(X_train, y_train)

[Bug] Multiple warnings and errors using the Readme example

Running the following example from the Readme on Google Colab:

from tune_sklearn import TuneSearchCV

# Other imports
import scipy
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

# Set training and validation sets
X, y = make_classification(n_samples=11000, n_features=1000, n_informative=50, n_redundant=0, n_classes=10, class_sep=2.5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)

# Example parameter distributions to tune from SGDClassifier
# Note the use of tuples instead if non-random optimization is desired
param_dists = {
    'alpha': (1e-4, 1e-1),
    'epsilon': (1e-2, 1e-1)
}

bohb_tune_search = TuneSearchCV(SGDClassifier(),
    param_distributions=param_dists,
    n_iter=2,
    max_iters=10,
    search_optimization="bohb"
)

bohb_tune_search.fit(X_train, y_train)

gives the following warnings and errors:

/usr/local/lib/python3.6/dist-packages/tune_sklearn/tune_basesearch.py:249: UserWarning: Early stopping is not enabled. To enable early stopping, pass in a supported scheduler from Tune and ensure the estimator has `partial_fit`.
  warnings.warn("Early stopping is not enabled. "
/usr/local/lib/python3.6/dist-packages/tune_sklearn/tune_basesearch.py:382: UserWarning: Hiding process output by default. To show process output, set verbose=2.
  warnings.warn("Hiding process output by default. "
Trial Runner checkpointing failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 373, in step
    self.checkpoint()
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 302, in checkpoint
    self._local_checkpoint_dir, session_str=self._session_str)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/suggest/search_generator.py", line 194, in save_to_dir
    self.CKPT_FILE_TMPL.format(session_str))
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/suggest/search_generator.py", line 32, in _atomic_save
    pickle.dump(state, f)
AttributeError: Can't pickle local object 'TuneSearchCV._fill_config_hyperparam.<locals>.get_sample.<locals>.<lambda>'
Trial Runner checkpointing failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 373, in step
    self.checkpoint()
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 302, in checkpoint
    self._local_checkpoint_dir, session_str=self._session_str)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/suggest/search_generator.py", line 194, in save_to_dir
    self.CKPT_FILE_TMPL.format(session_str))
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/suggest/search_generator.py", line 32, in _atomic_save
    pickle.dump(state, f)
AttributeError: Can't pickle local object 'TuneSearchCV._fill_config_hyperparam.<locals>.get_sample.<locals>.<lambda>'
Trial Runner checkpointing failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/tune.py", line 360, in run
    runner.checkpoint(force=True)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 302, in checkpoint
    self._local_checkpoint_dir, session_str=self._session_str)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/suggest/search_generator.py", line 194, in save_to_dir
    self.CKPT_FILE_TMPL.format(session_str))
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/suggest/search_generator.py", line 32, in _atomic_save
    pickle.dump(state, f)
AttributeError: Can't pickle local object 'TuneSearchCV._fill_config_hyperparam.<locals>.get_sample.<locals>.<lambda>'
/usr/local/lib/python3.6/dist-packages/sklearn/base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
TuneSearchCV(cv=None, early_stopping=None, error_score=nan,
             estimator=SGDClassifier(alpha=0.0001, average=False,
                                     class_weight=None, early_stopping=False,
                                     epsilon=0.1, eta0=0.0, fit_intercept=True,
                                     l1_ratio=0.15, learning_rate='optimal',
                                     loss='hinge', max_iter=1000,
                                     n_iter_no_change=5, n_jobs=None,
                                     penalty='l2', power_t=0.5,
                                     random_state=None, shuffle=True, tol=0.001,
                                     validation_fraction=0.1, verbose=0,
                                     warm_start=False),
             local_dir='~/ray_results', max_iters=1, n_iter=None, n_jobs=None,
             param_distributions={'alpha': (0.0001, 0.1),
                                  'epsilon': (0.01, 0.1)},
             random_state=None, refit=True, return_train_score=False,
             scoring=<function _passthrough_scorer at 0x7fec7a066598>,
             search_optimization='bohb', sk_n_jobs=-1, use_gpu=False,
             verbose=0)

Explicitly raise an exception if the user tries to tune `n_estimators` with early stopping

Wouldn't it be a good idea to explicitly raise an exception if the user tries to tune n_estimators with early stopping? I worry that warnings will be unnoticed.

Originally posted by @Yard1 in #90 (comment)

This should perhaps be raised in base_search. We may also want to raise exceptions for other automatic parameter settings too.

cc @inventormc

[feedback welcome] Support tensorboard, wandb loggers via strings

Tune currently has a variety of loggers supported: https://docs.ray.io/en/latest/tune/api_docs/logging.html#tbxlogger

We should allow this to also be accessible and configurable via tune-sklearn.

A proposal:

# Default to None
TuneSearchCV(loggers=None)

# Easy access
TuneSearchCV(loggers=["tensorboard"])

# support multiple loggers
TuneSearchCV(loggers=["tensorboard", "mlflow", "csv", "json"], logdir="./"))

# customize 
TuneSearchCV(loggers=["wandb"], logger_config={"..."})

Saved model with joblib NotFittedError unpredictably

Thanks for the library, been getting a lot of mileage out of this.

I'm fitting a TuneSearchCV instance

TuneSearchCV(
                model,
                config,
                n_trials=n_trials,
                cv=3,
                refit=True,
                search_optimization="hyperopt",
                loggers=["mlflow"],
                n_jobs=n_jobs,
                use_gpu=True
            )

and saving the instance with joblib dump, however, some of the saved models upon loading, return a NotFittedError after being loaded again:

sklearn.exceptions.NotFittedError: This TuneSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

I tested the Estimators and they were fitted before saving. I'm kind of at the end of my debugging skills and wonder if this is a problem with tune-sklearn, possibly even related to #107?

[feature request] Support `timeout` parameter

This seems to be a widely useful parameter. taken from:

https://twitter.com/Cpp_Learning/status/1299987716759666689

Metric name issue in tune_sklearn.tune_basesearch.TuneBaseSearchCV._format_results

Support Bayesian Search

https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html

Currently, tune-sklearn can only do randomized search for distributions, but it'd be nice to support bayesian search over hyperparameters.

TuneSearchCV(clf, param_dict, scheduler="asha", optimizer={"bayesian" or "random"}, n_jobs=-1)

[Bug] Setting 'auto_class_weights' parameter fails with CatBoostClassifier

Tuning CatBoostClassfier setting auto_class_weights within [None, 'Balanced', 'SqrtBalanced'] throws the following error:

ray::_Trainable.train() (pid=2361, ip=172.28.0.2)
  File "python/ray/_raylet.pyx", line 442, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 445, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 446, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 400, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 261, in train
    result = self._train()
  File "/usr/local/lib/python3.6/dist-packages/tune_sklearn/_trainable.py", line 128, in _train
    return_train_score=self.return_train_score,
  File "/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py", line 236, in cross_validate
    for train, test in cv.split(X, y, groups))
  File "/usr/local/lib/python3.6/dist-packages/joblib/parallel.py", line 1029, in __call__
    if self.dispatch_one_batch(iterator):
  File "/usr/local/lib/python3.6/dist-packages/joblib/parallel.py", line 819, in dispatch_one_batch
    islice = list(itertools.islice(iterator, big_batch_size))
  File "/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py", line 236, in <genexpr>
    for train, test in cv.split(X, y, groups))
  File "/usr/local/lib/python3.6/dist-packages/sklearn/base.py", line 78, in clone
    param2 = params_set[name]
KeyError: 'auto_class_weights'

[Bug] GPU not utilized

Tried running the following code on a Colab with the GPU enabled but I get a message saying Warning: you are connected to a GPU runtime, but not utilizing the GPU.

# Install and import libraries
!pip install -U lightgbm
!pip install -U 'ray[tune]'
!pip install -U scikit-optimize
!pip install -U tune_sklearn

from lightgbm import LGBMClassifier
import pandas as pd
import ray
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from tune_sklearn import TuneSearchCV

# Load data
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Tune
model = LGBMClassifier()
param_dists = {
    'boosting_type': ['gbdt'],
    'colsample_bytree': (0.8, 0.9, 'log-uniform'),
    'reg_alpha': (1.1, 1.3),
    'reg_lambda': (1.1, 1.3),
    'min_split_gain': (0.3, 0.4),
    'subsample': (0.7, 0.9),
    'subsample_freq': (20, 21)
}

tuner = TuneSearchCV(
    model,
    param_dists,
    n_iter=20,
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=2,
    max_iters=10,
    search_optimization='bayesian',
    use_gpu=True,
)

tuner.fit(X_train, y_train)
print('Best Parameters :', tuner.best_params_)

RandomizedSearch Parameter Lineup

When testing XGBClassifier examples, randomized search seems to have trouble lining up the correct sampled parameter value to the corresponding parameter. For example:

import warnings
warnings.filterwarnings('ignore')
import numpy as np
from tune_sklearn.tune_search import TuneRandomizedSearchCV
from sklearn import datasets
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

digits = datasets.load_digits()
x = digits.data
y = digits.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2)

# A parameter grid for XGBoost
params = {
    'min_child_weight': [1, 5, 10],
    'gamma': [0.5, 1, 1.5, 2, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'max_depth': [3, 4, 5],
}

xgb = XGBClassifier(learning_rate=0.02, n_estimators=50, objective='binary:logistic',
                    silent=True, nthread=1)

from tune_sklearn.tune_search import TuneRandomizedSearchCV, TuneGridSearchCV
digit_search = TuneRandomizedSearchCV(
    xgb, 
    # param_grid=params,
    param_distributions=params, 
    n_iter=3, 
)
digit_search.fit(x_train, y_train)

Will run into the issue:

/Users/anthony/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
xgboost.core.XGBoostError: value 3 for Parameter colsample_bytree exceed bound [0,1]
colsample_bytree: Subsample ratio of columns, resample on each tree construction.

Interestingly, reordering the parameters will change the error. Removing the max_depth parameter allows it to work, but adding other parameters like n_estimators into the dictionary also causes issues.

Looking into when we load the config dictionary before running tune.run, the parameters are properly lined up there, so currently it's unclear how this is happening and why re-ordering elements in a dictionary can affect the error deterministically.

[Enhancement] Smart error handling and hyperparameter resampling

I'm trying out Bayesian optimization but the tuning errors out whenever an incompatible combination of hyperparameters occur. For example if I provide the following param_diststributions for LogisticRegression:

penalty: [‘l1’, ‘l2’, ‘elasticnet’]
solver: [‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’]

it throws an error because elasticnet is only supported by the saga solver and all other combinations are invalid. Is there a way to ignore such cases and just proceed with the tuning by resampling the hyperparameters?

Pickling with PyTorch Module

When using TuneGridSearchCV with a torch.nn.Module, the interface is able to fit, but during fitting if Tune decides to make a checkpoint via pickling, trials cannot complete due to a PicklingError. See examples/torch_nn.py for an example of how this is done.

The class (Module) is dynamically created in Tune, which may be the cause of the error. Or it could simply be the fact that the class has been instantiated. Attempting to pickle an uninitialized Module seems to work fine.

Feature: add support for Optuna tuners

Note to self: Optuna is now available in main Ray Tune release and should be added here as an option.

This TuneGridSearchCV instance is not fitted yet.

I have a question,：Does Tune_sklearn only support estimators with 'partial_fit' or 'warm_start' attributes?

When I use the Randomforestclassifier as an estimator, the program always reports an error:
This TuneGridSearchCV instance is not fitted yet.

Code:

# from sklearn.model_selection import GridSearchCV
from tune_sklearn import TuneGridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Other imports
import time
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

# Set training and validation sets
X, y = make_classification(n_samples=11000, n_features=1000, n_informative=50, n_redundant=0, n_classes=10, class_sep=2.5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)

# Example parameters to tune from SGDClassifier
param_grid = {
        'n_estimators': [100,200,300],
        'max_depth':range(5,30,5),
        'min_samples_split':[1,2,5,10,15],
        'min_samples_leaf':[1,2,5,10],
        'max_features': ['log2','sqrt']
    }
forest_clf = RandomForestClassifier(random_state=42,warm_start=True)

grid_search = TuneGridSearchCV(forest_clf, param_grid, cv=5,scoring='accuracy',use_gpu=True)

start=time.time()
grid_search.fit(X_train, y_train)
end=time.time()
print('Tune time: ',end-start)
score=grid_search.score(X_test,y_test)
print("Tune Score:", score)

`early_stop` instead of `scheduler`

And maybe you can do early_stop=True

[Feature Request] Reproducible results for TuneSearchCV

Is it possible to produce reproducible results for TuneSearchCV using a random seed?

For evaluating multiple scores, use sklearn.model_selection.cross_validate instead

Reproducible Example

from tune_sklearn import TuneSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from scipy.stats import randint
import numpy as np

digits = datasets.load_digits()
x = digits.data
y = digits.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2)

clf = RandomForestClassifier(random_state=317, verbose=100)
param_distributions = {
    "n_estimators": (1, 120),
}

tune_search = TuneSearchCV(
    clf,
    param_distributions,
    scoring=(
        'homogeneity_score',
        'completeness_score',
    ),
    verbose=2,
    search_optimization='bayesian'
)

tune_search.fit(x_train, x_test)

pred = tune_search.predict(y_train)
accuracy = np.count_nonzero(
    np.array(pred) == np.array(y_test)) / len(pred)
print(accuracy)

results in:

/private/tmp/venv/lib/python3.8/site-packages/tune_sklearn/tune_basesearch.py:245: UserWarning: Early stopping is not enabled. To enable early stopping, pass in a supported scheduler from Tune and ensure the estimator has `partial_fit`.
  warnings.warn("Early stopping is not enabled. "
Redis failed to start, retrying now.
Traceback (most recent call last):
  File "pls.py", line 38, in <module>
    tune_search.fit(x_train, x_test)
  File "/private/tmp/venv/lib/python3.8/site-packages/tune_sklearn/tune_basesearch.py", line 368, in fit
    result = self._fit(X, y, groups, **fit_params)
  File "/private/tmp/venv/lib/python3.8/site-packages/tune_sklearn/tune_basesearch.py", line 290, in _fit
    self.scoring = check_scoring(self.estimator, scoring=self.scoring)
  File "/private/tmp/venv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/private/tmp/venv/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 430, in check_scoring
    raise ValueError("For evaluating multiple scores, use "
ValueError: For evaluating multiple scores, use sklearn.model_selection.cross_validate instead. ('homogeneity_score', 'completeness_score', 'v_measure_score', 'adjusted_rand_score', 'adjusted_mutual_info_score') was passed.

Environment

ray 0.8.6
sklearn 0.23.1
skopt 0.7.4
scipy 1.5.1
python 3.8.2

[feature] Support LightGBM incremental learning

We should allow users to toggle incremental learning for lightgbm once it is available on LightGBM master.

See #58.

cc @rohan-gt

Error messages when running random forest example

I have just tried to run your example script for random forest (https://github.com/ray-project/tune-sklearn/blob/master/examples/random_forest.py).

When I run it, I am getting following errors/warnings:

WARNING:ray.worker:The dashboard on node DESKTOP-KBV1GE9 failed with the following error:
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\dashboard/dashboard.py", line 960, in <module>
    metrics_export_address=metrics_export_address)
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\dashboard/dashboard.py", line 513, in __init__
    build_dir = setup_static_dir(self.app)
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\dashboard/dashboard.py", line 414, in setup_static_dir
    "&& npm run build)", build_dir)
FileNotFoundError: [Errno 2] Dashboard build directory not found. If installing from source, please follow the additional steps required to build the dashboard(cd python/ray/dashboard/client && npm ci && npm run build): 'C:\\ProgramData\\Anaconda3\\lib\\site-packages\\ray\\dashboard\\client/build'

ERROR:ray.tune.tune:Trial Runner checkpointing failed.
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\tune.py", line 332, in run
    runner.checkpoint(force=True)
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\trial_runner.py", line 279, in checkpoint
    os.rename(tmp_file_name, self.checkpoint_file)
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\Mislav\\ray_results\\_Trainable\\.tmp_checkpoint' -> 'C:\\Users\\Mislav\\ray_results\\_Trainable\\experiment_state-2020-07-20_19-42-07.json'
0.9555555555555556

It returns the result, but I am not sure what those error messages mean?

Pass local_dir parameter to search functions

Problem: It doesn't appear possible to pass the local_dir parameter to the search functions in tune-sklearn. This causes the temporary files to be created in the default directory, which in my case is problematic due to file size/count constraints on my home directory.

Proposed solution: Allow the local_dir parameter to be passed to the functions, for example:

tune_search = TuneGridSearchCV(
    pipeline_svc,
    parameters,
    local_dir = '/.../',   # <-----
    max_iters=10, 
)

There might be an alternative method to achieve this that is already implemented, however, I was going by the following part of the Ray documentation: https://docs.ray.io/en/latest/tune/api_docs/logging.html?highlight=ray_results#log-directory

Thanks!

[Bug] LightGBM error while using early stopping

I'm getting the following error while setting early_stopping=True in TuneSearchCV where estimator = LGBMClassifier(early_stopping_rounds=50) :

ValueError: For early stopping, at least one dataset and eval metric is required for evaluation

The same works fine for XGBoost. A couple of related questions:

Why isn't it mandatory for the user to set early_stopping_rounds within the estimator while setting early_stopping=True?
Still not sure how max_iters comes into play. When I set n_trials=10 and max_iters=10, it seems to be running 100 trials anyway. How is the early_stopping happening?
Isn't it better to merge early_stopping and early_stopping_rounds like how it is done in LightGBMTunerCV?
If another level of early stopping has to be implemented across trials (mentioned here), will the same ASHA scheduler work?

ValueError: Search optimization must be random or bayesian

Hi, I just tried to run the newly added example as follows and got this error; my system is windows 10 64 bit, please help.

Example code:

from tune_sklearn import TuneSearchCV
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

digits = datasets.load_digits()
X = digits.data
y = digits.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

space = {
"n_estimators": (100, 200),
"min_weight_fraction_leaf": (0.0, 0.5),
"min_samples_leaf": (1, 5)
}

tune_search = TuneSearchCV(
RandomForestClassifier(),
space,
search_optimization="hyperopt",
n_iter=3,
max_iters=10)
tune_search.fit(X_train, y_train)

print(tune_search.cv_results_)
print(tune_search.best_params_)

Error information:
ValueError Traceback (most recent call last)
in
21 search_optimization="hyperopt",
22 n_iter=3,
---> 23 max_iters=10)
24 tune_search.fit(X_train, y_train)
25

D:\Anaconda\anaconda\envs\ml\lib\site-packages\tune_sklearn\tune_search.py in init(self, estimator, param_distributions, early_stopping, n_iter, scoring, n_jobs, sk_n_jobs, refit, cv, verbose, random_state, error_score, return_train_score, local_dir, max_iters, search_optimization, use_gpu)
180
181 if (search_optimization not in ["random", "bayesian"]):
--> 182 raise ValueError("Search optimization must be random or bayesian")
183 if (search_optimization == "bayesian" and random_state is not None):
184 warnings.warn(

ValueError: Search optimization must be random or bayesian

Feature request: use number of samples as resource for early stopping

Successive halving has been introduced in scikit-learn, and the default resource is the number of data samples used to train the model. It would be great to have this option in tune-sklearn as well, because it would mean that early stopping could be used in combination with any scikit-learn compliant estimator.

Use `ray.put` for shared memory

Right now, we do N transfers of the same data object. It would be nice to avoid doing that by using the shared memory object store via ray.put

object_id = ray.put(X)

class SKLTrainable(tune.Trainable):
    def _setup(self, config):
        object_id = config["data"]
        new_X = ray.get(object_id)


tune.run(SKLTrainable, config={"data": object_id})

TuneError: Insufficient cluster resources to launch trial: trial requested

I have just tried TuneSearchCV function with search_optimization='bayesian' option:

    param_bayes = {
        "n_estimators": (50, 1000),
        "max_depth": (2, 7),
        'max_features': (1, 30)
        # 'min_weight_fraction_leaf': (0.03, 0.1, 'uniform')
    }


    # clf = joblib.load("rf_model.pkl")
    rf = RandomForestClassifier(criterion='entropy',
                                class_weight='balanced_subsample',
                                random_state=rand_state)
    
    # tune search    
    tune_search = TuneSearchCV(
        rf,
        param_bayes,
        search_optimization='bayesian',
        max_iters=10,
        scoring='f1',
        n_jobs=16,
        cv=cv,
        verbose=1
    )
    
 
    tune_search.fit(X_train, y_train, sample_weight=sample_weigths)

I get the following output:

== Status ==
Memory usage on this node: 23.8/31.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 16/32 CPUs, 0/0 GPUs, 0.0/7.28 GiB heap, 0.0/2.49 GiB objects
Result logdir: C:\Users\Mislav\ray_results\_Trainable
Number of trials: 10 (1 ERROR, 9 PENDING)

Trial name | status | loc | max_depth | max_features | n_estimators
-- | -- | -- | -- | -- | --
_Trainable_f5813982 | ERROR |   | 4 | 14 | 54
_Trainable_f5824ae6 | PENDING |   | 6 | 26 | 317
_Trainable_f5835c58 | PENDING |   | 7 | 7 | 863
_Trainable_f5846dcc | PENDING |   | 5 | 8 | 667
_Trainable_f5855824 | PENDING |   | 3 | 20 | 82
_Trainable_f586699c | PENDING |   | 5 | 6 | 516
_Trainable_f5877b08 | PENDING |   | 4 | 29 | 435
_Trainable_f5888c7a | PENDING |   | 4 | 26 | 823
_Trainable_f5899df8 | PENDING |   | 6 | 14 | 68
_Trainable_f58aaf58 | PENDING |   | 6 | 12 | 292

ERROR:ray.tune.ray_trial_executor:Trial _Trainable_f5824ae6: Unexpected error starting runner.
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\ray_trial_executor.py", line 294, in start_trial
    self._start_trial(trial, checkpoint, train=train)
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\ray_trial_executor.py", line 233, in _start_trial
    runner = self._setup_remote_runner(trial, reuse_allowed)
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\ray_trial_executor.py", line 129, in _setup_remote_runner
    trial.init_logger()
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\trial.py", line 318, in init_logger
    self.local_dir)
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\trial.py", line 310, in create_logdir
    dir=local_dir)
  File "C:\ProgramData\Anaconda3\lib\tempfile.py", line 366, in mkdtemp
    _os.mkdir(file, 0o700)
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'C:\\Users\\Mislav\\ray_results\\_Trainable\\_Trainable_2_X_id=ObjectID(ffffffffffffffffffffffff0100008013000000),cv=PurgedKFold(n_splits=4, pct_embargo=0.0,\n      samples_inf_2020-07-21_08-32-35vlc0o8vf'
WARNING:ray.tune.utils.util:The `start_trial` operation took 2.002230405807495 seconds to complete, which may be a performance bottleneck.

---------------------------------------------------------------------------
TuneError                                 Traceback (most recent call last)
c:\Users\Mislav\Documents\GitHub\trademl\trademl\modeling\train_rf_sklearnopt.py in <module>
----> <a href='file://c:\Users\Mislav\Documents\GitHub\trademl\trademl\modeling\train_rf_sklearnopt.py?line=228'>229</a> tune_search.fit(X_train, y_train, sample_weight=sample_weigths)

C:\ProgramData\Anaconda3\lib\site-packages\tune_sklearn\tune_basesearch.py in fit(self, X, y, groups, **fit_params)
    366                 ray.init(ignore_reinit_error=True, configure_logging=False)
    367 
--> 368             result = self._fit(X, y, groups, **fit_params)
    369 
    370             if not ray_init and ray.is_initialized():

C:\ProgramData\Anaconda3\lib\site-packages\tune_sklearn\tune_basesearch.py in _fit(self, X, y, groups, **fit_params)
    320 
    321         self._fill_config_hyperparam(config)
--> 322         analysis = self._tune_run(config, resources_per_trial)
    323 
    324         self.cv_results_ = self._format_results(self.n_splits, analysis)

C:\ProgramData\Anaconda3\lib\site-packages\tune_sklearn\tune_search.py in _tune_run(self, config, resources_per_trial)
    337                 fail_fast=True,
    338                 checkpoint_at_end=True,
--> 339                 resources_per_trial=resources_per_trial)
    340 
    341         return analysis

C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\tune.py in run(run_or_experiment, name, stop, config, resources_per_trial, num_samples, local_dir, upload_dir, trial_name_creator, loggers, sync_to_cloud, sync_to_driver, checkpoint_freq, checkpoint_at_end, sync_on_checkpoint, keep_checkpoints_num, checkpoint_score_attr, global_checkpoint_period, export_formats, max_failures, fail_fast, restore, search_alg, scheduler, with_server, server_port, verbose, progress_reporter, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, return_trials, ray_auto_init)
    325 
    326     while not runner.is_finished():
--> 327         runner.step()
    328         if verbose:
    329             _report_progress(runner, progress_reporter)

C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\trial_runner.py in step(self)
    340             self._process_events()  # blocking
    341         else:
--> 342             self.trial_executor.on_no_available_trials(self)
    343 
    344         self._stop_experiment_if_needed()

C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\trial_executor.py in on_no_available_trials(self, trial_runner)
    173                              self.resource_string(),
    174                              trial.get_trainable_cls().resource_help(
--> 175                                  trial.config)))
    176             elif trial.status == Trial.PAUSED:
    177                 raise TuneError("There are paused trials, but no more pending "

TuneError: Insufficient cluster resources to launch trial: trial requested 16 CPUs, 0 GPUs but the cluster has only 32 CPUs, 0 GPUs, 7.28 GiB heap, 2.49 GiB objects (1.0 node:192.168.1.4). Pass `queue_trials=True` in ray.tune.run() or on the command line to queue trials until the cluster scales up or resources become available.

Use setuptools for better pip compatibility

from setuptools import setup

Benchmarking

tune-sklearn's fit times don't scale linearly with training data size, while other libraries and sklearn itself do. tune-sklearn appears to slow down a lot with a bigger size, but this may be due to resources being hogged by other processes on the machine that we tested on. Should try to rerun benchmarking on a machine not being used by other processes to get an accurate idea of fit times.

Multi-metric scoring with non-default `search_optimization` values

When using multi-metric scoring with non-default values for search_optimization in TuneSearchCV, I was getting a KeyError when the refitting process was happening. I've included a code snipped at a screenshot of the error below. I had already successfully installed scikit-optimize and hyperopt already.

A code-snipped that reproduces this error is:

from tune_sklearn import TuneSearchCV
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

# Set training and validation sets
X, y = make_regression(n_samples=100, n_features=10, n_informative=5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=20)

# Example parameters to tune from xgb
parameters = {
                "regressor__max_depth": [2,3,5,8],
                "regressor__learning_rate": (0.05, 0.2, 'uniform')
}

tune_search = TuneSearchCV(
    XGBRegressor(random_state=42, objective='reg:squarederror',
                n_estimators=1000, verbose=False, n_jobs = 1),
    parameters,
    n_jobs=1,
    refit='mae',
    search_optimization='bayesian',
    n_trials = 2,
    scoring = {
                "mae": "neg_mean_absolute_error",
                "rmse": "neg_root_mean_squared_error",
                "r2": "r2",
    }
)
tune_search.fit(X_train, y_train)
end = time.time()