ray-project / lightgbm_ray Goto Github PK

View Code? Open in Web Editor NEW

42.0 42.0 7.0 169 KB

LightGBM on Ray

License: Apache License 2.0

Shell 8.20% Python 91.80%

lightgbm_ray's People

Contributors

Stargazers

Watchers

Forkers

krfricke yard1 jypeng28 peytondmurray jimthompson5802 justinvyu xiaoxiaoma549

lightgbm_ray's Issues

"grpc_message":"Received message larger than max

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.RESOURCE_EXHAUSTED
details = "Received message larger than max (210771146 vs. 104857600)"
debug_error_string = "{"created":"@1646188124.309444695","description":"Error received from peer ipv4:10.207.183.32:40455","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Received message larger than max (210771146 vs. 104857600)","grpc_status":8}"

Support for CrossValidation: Enhancement Request

I am using RayDP with Spark and am using this package with Ray Tune for HyperParameter Optimization with the lightGBM regressor. Unless there is something I'm missing, there's no way to use lgbm's native cross validation as in Ray's examples, this would be a huge help to model accuracy when training large models.

How to load LIBSVM files?

Hi, is it possible to load LIBSVM files using the RayDMatrix class?
Thanks

Unintuitive naming of RayDMatrix class

I've just realized that RayDMatrix class in lightgbm_ray is named xgboost_ray.matrix.RayDMatrix, rather than lightgbm_ray.matrix.RayDMatrix.

I understand code re-use, but name re-use? It is apparently by design (as per docs), but in my opinion it violates the principle of least astonishment and thus should be changed to a more intuitive lightgbm_ray.matrix.RayDMatrix.

To reproduce:

from lightgbm_ray import RayParams as test1, RayDMatrix as test2
print([test1, test2])

Output:

[<class 'lightgbm_ray.main.RayParams'>, <class 'xgboost_ray.matrix.RayDMatrix'>]

Ray lightgbm reproducibility issue

@Yard1 Hi Sir, I was trying Light GBM Ray on a large dataset with 3 num actors and 3 CPUs per actor. With this context, the result keeps changing across different runs. Can you guide how to make results reproducable in LightGBM-Ray ?

I have set the following seeds:

Lightgbm random state seed

import numpy as np
np.random.seed(seed)

import random as python_random
python_random.seed(seed)

Any more seeds or parameters to set ?

Job supervisor actor -> subprocess fate sharing

import RayLGBMClassifier error

I'm getting the following errors when I try to import raylgbmclassifier.

Traceback (most recent call last):
  File "/home/mforeman/miniconda3/envs/rapids-23.06/classifiers4.py", line 14, in <module>
    from lightgbm_ray import RayLGBMClassifier
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/lightgbm_ray/__init__.py", line 1, in <module>
    from lightgbm_ray.main import RayParams, train, predict
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/lightgbm_ray/main.py", line 55, in <module>
    from xgboost_ray.main import (
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/xgboost_ray/__init__.py", line 1, in <module>
    from xgboost_ray.main import RayParams, predict, train
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/xgboost_ray/main.py", line 76, in <module>
    from xgboost_ray.matrix import (
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/xgboost_ray/matrix.py", line 36, in <module>
    from ray.data.dataset import Dataset as RayDataset
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/data/__init__.py", line 5, in <module>
    from ray.data._internal.compute import ActorPoolStrategy
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/data/_internal/compute.py", line 8, in <module>
    from ray.data._internal.delegating_block_builder import DelegatingBlockBuilder
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/data/_internal/delegating_block_builder.py", line 4, in <module>
    from ray.data._internal.arrow_block import ArrowBlockBuilder
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/data/_internal/arrow_block.py", line 22, in <module>
    from ray.data._internal.numpy_support import (
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/data/_internal/numpy_support.py", line 5, in <module>
    from ray.air.util.tensor_extensions.utils import create_ragged_ndarray
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/air/__init__.py", line 1, in <module>
    from ray.air.checkpoint import Checkpoint
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/air/checkpoint.py", line 22, in <module>
    from ray.air._internal.remote_storage import (
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/air/_internal/remote_storage.py", line 142, in <module>
    _cached_fs: Dict[tuple, Tuple[float, pyarrow.fs.FileSystem]] = {}
AttributeError: 'NoneType' object has no attribute 'fs'

Support elastic training

where are the examples

The example folder is empty and the links to these examples are all broken. Please provided an updated link to the examples, thank you!

Interaction constraints not working

Hi,

I've been testing using the interaction_constraints parameter from lightgbm (see here).

Unfortunately, passing in the list of constraints causes training to fail with a sigsegv error.

Example:

#%%
# set up and load boston data
import numpy as np
import pandas as pd
from lightgbm_ray import RayLGBMRegressor, RayParams, RayDMatrix
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import ray
import os

boston = load_boston()
x, y = boston.data, boston.target
df = pd.DataFrame(x, columns= boston.feature_names)

# make into dmatrix
train_df_with_target = df.copy()
train_df_with_target['target'] = y

train_set = RayDMatrix(
    data=train_df_with_target,
    label = 'target'
    )

# set params and ray params
params = {
    'boosting_type': 'goss',
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 10,
    'max_depth': 4,
    'learning_rate': 0.05,
    'verbose': 1
}

ray_params = RayParams(
    num_actors=2,
    cpus_per_actor = 2,
    )



#%% set up constraint (age cannot interact with any other feature)
constrained_feature = 'AGE'
other_features = [x for x,y in enumerate(df.columns) if y != constrained_feature ]
constrained_feature_idx = [x for x,y in enumerate(df.columns) if y == constrained_feature ]

constraint = [constrained_feature_idx, other_features]


#%% fit model
mod_ray_constrained = RayLGBMRegressor(
    random_state=100,
    interaction_constraints = constraint,
    **params
)


mod_ray_constrained.fit(train_set,
    y='target',
    eval_set = [(train_set, 'target')],
    eval_names=["train"],
    ray_params=ray_params)

The constrained model fit returns the error:

(_RemoteRayLightGBMActor pid=266, ip=10.99.13.194) *** SIGSEGV received at time=1673976819 on cpu 3 ***
(_RemoteRayLightGBMActor pid=266, ip=10.99.13.194) PC: @ 0x7fedc74926f7 (unknown) (unknown)
(_RemoteRayLightGBMActor pid=266, ip=10.99.13.194) @ 0x7fedc750d420 (unknown) (unknown)
(_RemoteRayLightGBMActor pid=266, ip=10.99.13.194) [2023-01-17 09:33:39,994 E 266 292] logging.cc:361: *** SIGSEGV received at time=1673976819 on cpu 3 ***
(_RemoteRayLightGBMActor pid=266, ip=10.99.13.194) [2023-01-17 09:33:39,994 E 266 292] logging.cc:361: PC: @ 0x7fedc74926f7 (unknown) (unknown)
(_RemoteRayLightGBMActor pid=266, ip=10.99.13.194) [2023-01-17 09:33:39,994 E 266 292] logging.cc:361: @ 0x7fedc750d420 (unknown) (unknown)
(_RemoteRayLightGBMActor pid=266, ip=10.99.13.194) Fatal Python error: Segmentation fault

Running this code with non-distributed lightgbm works fine, as does the above code with interaction constraints removed.

Fix client tests being flaky (timing out)

Client tests in test_end_to_end.py time out often during Github Actions CI (though not always). This should be fixed.

Doesn't seem to time out locally. My guess it's due to less cores being available on the CI runner.

Ray Tune custom callback based on model structure

I have some code that uses a callback to stop a Ray Tune trial if the complexity of the model (total leaves in the model) exceeds a given threshold). This works fine with a normal lightgbmmodel but fails when I use a lightgbm_ray model.

In the below code, "use_distributed" can be toggled to True to reproduce the error.

I presume the error is because the correct way of passing the metrics back to tune is with the TuneReportCheckpointCallback() from ray.tune.integration.lightgbm. I've played around with this, but it seems like I can only access the metrics reported by the lightgbm model. I can't add the "total_leaves" as a metric because it relies on accessing the model itself, not just the data and predictions.

Is it possible to report total_leaves to ray tune with lightgbm_ray?

#%%
# set up and load boston data
import numpy as np
import pandas as pd
import os
import lightgbm
from lightgbm_ray import RayLGBMRegressor, RayParams, RayDMatrix
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import ray
from ray.air import session
from ray import tune
from ray.tune.search.optuna import OptunaSearch


ray.shutdown()
## Initialise ray:
if ray.is_initialized() == False:
    service_host = os.environ['RAY_HEAD_SERVICE_HOST']
    service_port = os.environ['RAY_HEAD_SERVICE_PORT']
    ray.init(
        f'ray://{service_host}:{service_port}'
    )

use_distributed = False
out_dir =< '/path/to/output_folder'>

boston = load_boston()
x, y = boston.data, boston.target
df = pd.DataFrame(x, columns= boston.feature_names)

# make into dmatrix
if use_distributed:
    actors = 2
    ray_params = RayParams(
        num_actors= actors,
        cpus_per_actor = 2,
    )


    train_df_with_target = df.copy()
    train_df_with_target['target'] = y

    train_set = RayDMatrix(
        data=train_df_with_target,
        label = 'target'
        )
else:
    actors = 1
    

# set params and ray params
params = {
    'boosting_type': 'goss',
    'objective': 'regression',
    'metric': 'rmse',
    'n_estimators':100,
    'num_leaves': 6,
    'max_depth': 3,
    'learning_rate': tune.quniform(0.05,0.1, 0.01),
    'verbose': 1
}




#%% define function to count total leaves in model
def leaves_callback(env):
    model = env.model

    mod_dump = model.dump_model()
    tree_info = mod_dump['tree_info']
    num_leaves = 0
    num_iterations = 0
    for tree in tree_info:
        num_leaves += tree['num_leaves']
        num_iterations += 1

    session.report({'total_leaves': num_leaves,
                    "rmse_train":  env.evaluation_result_list[0][2],
                    'num_iterations': num_iterations})

# define trainable
def trainable(params):
    if use_distributed:
        mod_ray = RayLGBMRegressor(
            random_state=100,
            **params
        )


        mod_ray.fit(train_set,
            y='target',
            eval_set = [(train_set, 'target')],
            eval_names=["train"],
            ray_params=ray_params,
            callbacks = [leaves_callback])
    else:
        mod = lightgbm.LGBMRegressor(
            random_state=100,
            **params
        )

        mod.fit(X = x,
            y=y,
            eval_set = [(x, y)],
            eval_names=["train"],
            callbacks = [leaves_callback])


#%% RUN TUNING

resources = [{'CPU': 2.0} for x in range(actors+1)] + [{'CPU': 1.0}]

analysis = tune.Tuner(
    tune.with_resources(
            trainable,
            tune.PlacementGroupFactory(
                resources,
                strategy='PACK')
        ),
    tune_config=tune.TuneConfig(
        metric="rmse_train",
        mode= "min",
        search_alg=OptunaSearch(),
        num_samples=5),
        
    run_config= ray.air.RunConfig(local_dir=out_dir,
                                name = 'test_callback',
                                stop = {'total_leaves': 300}),
    param_space= params,     
    )


results = analysis.fit()

If I toggle use_distributed to True

(_RemoteRayLightGBMActor pid=585, ip=10.99.15.76) File "/opt/conda/lib/python3.9/site-packages/ray/air/session.py", line 61, in report
(_RemoteRayLightGBMActor pid=585, ip=10.99.15.76) _get_session().report(metrics, checkpoint=checkpoint)
(_RemoteRayLightGBMActor pid=585, ip=10.99.15.76) AttributeError: 'NoneType' object has no attribute 'report'

If I toggle use_distributed to False, I get the expected result:

(TunerInternal pid=2096) +--------------------+------------+-----------------+-----------------+--------+------------------+----------------+--------------+------------------+
(TunerInternal pid=2096) | Trial name | status | loc | learning_rate | iter | total time (s) | total_leaves | rmse_train | num_iterations |
(TunerInternal pid=2096) |--------------------+------------+-----------------+-----------------+--------+------------------+----------------+--------------+------------------|
(TunerInternal pid=2096) | trainable_a895fa72 | TERMINATED | 10.99.5.8:2131 | 0.05 | 56 | 0.24444 | 300 | 3.44845 | 56 |
(TunerInternal pid=2096) | trainable_aa0be088 | TERMINATED | 10.99.5.8:2131 | 0.1 | 61 | 0.296924 | 302 | 2.82896 | 61 |
(TunerInternal pid=2096) | trainable_aa3ab41c | TERMINATED | 10.99.5.8:2131 | 0.08 | 60 | 0.354107 | 301 | 2.89081 | 60 |
(TunerInternal pid=2096) | trainable_aa6d4d32 | TERMINATED | 10.99.15.76:749 | 0.07 | 59 | 0.310418 | 300 | 2.99355 | 59 |
(TunerInternal pid=2096) | trainable_aa89c7a0 | TERMINATED | 10.99.5.8:2131 | 0.05 | 56 | 0.265122 | 300 | 3.44845 | 56 |
(TunerInternal pid=2096) +--------------------+------------+-----------------+-----------------+--------+------------------+----------------+--------------+------------------+

Support early stopping

Error when running example

Setup

conda create --name lgbm python=3.8
conda activate lgbm
conda install lightgbm
pip install lightgbm_ray

Script:

# light_ray.py
from lightgbm_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = RayDMatrix(train_x, train_y)

evals_result = {}
bst = train(
    {
        "objective": "binary",
        "metric": ["binary_logloss", "binary_error"],
    },
    train_set,
    num_boost_round=10,
    evals_result=evals_result,
    valid_sets=[train_set],
    valid_names=["train"],
    verbose_eval=False,
    ray_params=RayParams(num_actors=2, cpus_per_actor=2))


bst.booster_.save_model("model.lgbm")

Exception:

% python light_ray.py 
Traceback (most recent call last):
  File "light_ray.py", line 1, in <module>
    from lightgbm_ray import RayDMatrix, RayParams, train
  File "/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/site-packages/lightgbm_ray/__init__.py", line 1, in <module>
    from lightgbm_ray.main import RayParams, train, predict
  File "/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/site-packages/lightgbm_ray/main.py", line 44, in <module>
    from lightgbm import LGBMModel, LGBMRanker, Booster
  File "/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/site-packages/lightgbm/__init__.py", line 8, in <module>
    from .basic import Booster, Dataset, register_logger
  File "/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/site-packages/lightgbm/basic.py", line 95, in <module>
    _LIB = _load_lib()
  File "/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/site-packages/lightgbm/basic.py", line 86, in _load_lib
    lib = ctypes.cdll.LoadLibrary(lib_path[0])
  File "/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/ctypes/__init__.py", line 459, in LoadLibrary
    return self._dlltype(name)
  File "/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/ctypes/__init__.py", line 381, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/site-packages/lightgbm/lib_lightgbm.so, 6): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
  Referenced from: /Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/site-packages/lightgbm/lib_lightgbm.so
  Reason: image not found