deepsignsecurity / lightgbm-rs Goto Github PK
View Code? Open in Web Editor NEWThis project forked from vaaaaanquish/lightgbm-rs
advanced fork of LightGBM Rust binding
License: MIT License
This project forked from vaaaaanquish/lightgbm-rs
advanced fork of LightGBM Rust binding
License: MIT License
Running lightgbm in other languages produces output kind of like this:
The important thing here is that in addition to the training dataset you can specify one (or more) validation datasets, and every round of training (or ever metric_freq
rounds in the cli) you get the current scores of the training and validation sets on each metric. This allows you to judge if training is still producing progress, and if overfitting occurs.
The C API supports this via the LGBM_BoosterAddValidData
and LGBM_BoosterGetEval
, we just need to wire them up in a reasonable way.
The most natural way from the user point of view would be to specify an array of validation datasets in the booster::train call, even if this requires us to register and unregister them in the booster. We would also have to enhance the return type, to return not just the booster but also validation results. It might also be interesting to have the ability to specify callbacks that get informed of results as they happen, to allow printing results interactively, or running smart logic.
Since validation does take time, retaining support for the current mode of not running it at all might be interesting (maybe by just keeping the old api around, but that feels like code duplication).
Currently the booster::train method takes a (serde_)json dictionary of parameters. This isn't idiomatic and makes it easier to make mistakes. it would be nicer to be able to write
booster.train(dataset, TrainingParameters { num_iterations: 100, metric: Metric::Huber, learning_rate: 0.2, ...default::Default }
deepsign-training implements a subset of this. There are two challenges:
Option<T>
, a builder-pattern, or some conversion logic (the route taken by deepsign-training, where parameters are not serialized to json whe they are invalid)Currently booster::predict runs on all cores. When you predict one or few datapoints, this is much slower than single-threaded prediction due to syncronization overhead (not to mention starting all those threads).
The number of threads can be modified by setting it in the params
variable in prediction like this let params = CString::new("num_threads=1").unwrap();
. Currently this option is not exposed at all. An easy fix would be to add a prediction variant with a settings object (maybe with builder pattern to make it easier to add more options in later versions without breaking compatability). The more sophisticated version might be to look at how many data points were given to the predict method, and choose a reasonable number of threads based on that. It should be possible to do something reasonable by running a couple benchmarks. We would have to investigate if there's a difference between windows and unix-likes though due to different overhead of starting threads.
The C API contains a LGBM_BoosterMerge method. I can't find any examples of this being used (except the python library using it to copy an existing booster into an empty one), but I'm hoping it might be useful for handling ensembles of boosters trained on different data sets.
I'd imagine something in the form of pub fn merge_from(&mut self, other: &Booster)
, implemented on Booster
, along with a unit test that trains two simple boosters, merges them and demonstrates the prediction of the merge.
The current API has 2 major problems:
The Code in booster.rs
and dataset.rs
should be rewritten to improve these points. Additionally, functionality extension should be easier, especially wrt. currently open Issues (#8, #6, #5, #3).
Guidelines to follow:
Normally LightGBM supports early stopping, controlled by the early_stopping_round
and first_metric_only
parameters. The idea is basically to stop training once it stops making progress for some amount of time, and roll back to the round that produced the highest score on validation data (by calling LGBM_BoosterRollbackOneIter
the appropriate number of times). This avoids wasting training time, and improves the model by reducing overfitting.
This depends on the metrics of #5.
If #5 implements callbacks for live updates, a minor change to allow the callback to communicate back some decisions would make it possible to implement early stopping entirely as a callback. This is the route taken by tensorflow, and makes it easy to switch in other early stopping techniques.
In CI, Ubuntu builds seem to work while MacOS builds fail. Looks like a missing dependency?
Currently CI only tests Linux and Ubuntu. We should add Windows. This should be relatively easy, github has a windows-latest runner, and that runner has vcpkg (and chocolately) as package managers preinstalled, so getting nessesary dependencies should be easy (we can also steal config from deepsign-client, adapting to the style of CI used here)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.