deepsignsecurity / lightgbm-rs Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vaaaaanquish/lightgbm-rs

1.0 1.0 0.0 125 KB

advanced fork of LightGBM Rust binding

License: MIT License

C 0.10% Rust 99.53% Dockerfile 0.37%

lightgbm-rs's People

Contributors

Stargazers

Watchers

lightgbm-rs's Issues

Allow adding validation data, return metrics

Running lightgbm in other languages produces output kind of like this:

The important thing here is that in addition to the training dataset you can specify one (or more) validation datasets, and every round of training (or ever metric_freq rounds in the cli) you get the current scores of the training and validation sets on each metric. This allows you to judge if training is still producing progress, and if overfitting occurs.

The C API supports this via the LGBM_BoosterAddValidData and LGBM_BoosterGetEval, we just need to wire them up in a reasonable way.

The most natural way from the user point of view would be to specify an array of validation datasets in the booster::train call, even if this requires us to register and unregister them in the booster. We would also have to enhance the return type, to return not just the booster but also validation results. It might also be interesting to have the ability to specify callbacks that get informed of results as they happen, to allow printing results interactively, or running smart logic.

Since validation does take time, retaining support for the current mode of not running it at all might be interesting (maybe by just keeping the old api around, but that feels like code duplication).

Make Struct for training parmeters

Currently the booster::train method takes a (serde_)json dictionary of parameters. This isn't idiomatic and makes it easier to make mistakes. it would be nicer to be able to write

booster.train(dataset, TrainingParameters { num_iterations: 100, metric: Metric::Huber, learning_rate: 0.2, ...default::Default }

deepsign-training implements a subset of this. There are two challenges:

setting something to its default value is not the same as not setting it at all. The most notable examples are parameters that are only valid in combination with other parameters. That means either liberal application of Option<T>, a builder-pattern, or some conversion logic (the route taken by deepsign-training, where parameters are not serialized to json whe they are invalid)
There are a lot of parameters: https://lightgbm.readthedocs.io/en/latest/Parameters.html

Be smarter about multithreading when predicting

Currently booster::predict runs on all cores. When you predict one or few datapoints, this is much slower than single-threaded prediction due to syncronization overhead (not to mention starting all those threads).

The number of threads can be modified by setting it in the params variable in prediction like this let params = CString::new("num_threads=1").unwrap();. Currently this option is not exposed at all. An easy fix would be to add a prediction variant with a settings object (maybe with builder pattern to make it easier to add more options in later versions without breaking compatability). The more sophisticated version might be to look at how many data points were given to the predict method, and choose a reasonable number of threads based on that. It should be possible to do something reasonable by running a couple benchmarks. We would have to investigate if there's a difference between windows and unix-likes though due to different overhead of starting threads.

Add method to merge boosters

The C API contains a LGBM_BoosterMerge method. I can't find any examples of this being used (except the python library using it to copy an existing booster into an empty one), but I'm hoping it might be useful for handling ensembles of boosters trained on different data sets.

I'd imagine something in the form of pub fn merge_from(&mut self, other: &Booster), implemented on Booster, along with a unit test that trains two simple boosters, merges them and demonstrates the prediction of the merge.

API Rewrite

The current API has 2 major problems:

It doesn't follow ML code style conventions
It only implements just enough of the C FFI, got get a proof-of-concept fit-predict pipeline working
There is no idiomatic way to handle optional parameters or multiple validation sets.

The Code in booster.rs and dataset.rs should be rewritten to improve these points. Additionally, functionality extension should be easier, especially wrt. currently open Issues (#8, #6, #5, #3).

Guidelines to follow:

Use builder pattern for Booster and Dataset
Change function name to fit more in line with popular ML Frameworks like sk-learn

Add early stopping

Normally LightGBM supports early stopping, controlled by the early_stopping_round and first_metric_only parameters. The idea is basically to stop training once it stops making progress for some amount of time, and roll back to the round that produced the highest score on validation data (by calling LGBM_BoosterRollbackOneIter the appropriate number of times). This avoids wasting training time, and improves the model by reducing overfitting.

This depends on the metrics of #5.

If #5 implements callbacks for live updates, a minor change to allow the callback to communicate back some decisions would make it possible to implement early stopping entirely as a callback. This is the route taken by tensorflow, and makes it easy to switch in other early stopping techniques.

MacOS CI failing

In CI, Ubuntu builds seem to work while MacOS builds fail. Looks like a missing dependency?

Add Windows to CI

Currently CI only tests Linux and Ubuntu. We should add Windows. This should be relatively easy, github has a windows-latest runner, and that runner has vcpkg (and chocolately) as package managers preinstalled, so getting nessesary dependencies should be easy (we can also steal config from deepsign-client, adapting to the style of CI used here)

deepsignsecurity / lightgbm-rs Goto Github PK

lightgbm-rs's People

Contributors

Stargazers

Watchers

lightgbm-rs's Issues

Allow adding validation data, return metrics

Make Struct for training parmeters

Be smarter about multithreading when predicting

Add method to merge boosters

API Rewrite

Add early stopping

MacOS CI failing

Add Windows to CI

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent