Git Product home page Git Product logo

lightgbm-rs's People

Contributors

abhizer avatar benjaminjellis avatar geohardtke avatar leofidus avatar paq avatar vaaaaanquish avatar

Stargazers

 avatar

Watchers

 avatar

lightgbm-rs's Issues

Allow adding validation data, return metrics

Running lightgbm in other languages produces output kind of like this:

image

The important thing here is that in addition to the training dataset you can specify one (or more) validation datasets, and every round of training (or ever metric_freq rounds in the cli) you get the current scores of the training and validation sets on each metric. This allows you to judge if training is still producing progress, and if overfitting occurs.

The C API supports this via the LGBM_BoosterAddValidData and LGBM_BoosterGetEval, we just need to wire them up in a reasonable way.

The most natural way from the user point of view would be to specify an array of validation datasets in the booster::train call, even if this requires us to register and unregister them in the booster. We would also have to enhance the return type, to return not just the booster but also validation results. It might also be interesting to have the ability to specify callbacks that get informed of results as they happen, to allow printing results interactively, or running smart logic.

Since validation does take time, retaining support for the current mode of not running it at all might be interesting (maybe by just keeping the old api around, but that feels like code duplication).

Make Struct for training parmeters

Currently the booster::train method takes a (serde_)json dictionary of parameters. This isn't idiomatic and makes it easier to make mistakes. it would be nicer to be able to write

booster.train(dataset, TrainingParameters { num_iterations: 100, metric: Metric::Huber, learning_rate: 0.2, ...default::Default }

deepsign-training implements a subset of this. There are two challenges:

  • setting something to its default value is not the same as not setting it at all. The most notable examples are parameters that are only valid in combination with other parameters. That means either liberal application of Option<T>, a builder-pattern, or some conversion logic (the route taken by deepsign-training, where parameters are not serialized to json whe they are invalid)
  • There are a lot of parameters: https://lightgbm.readthedocs.io/en/latest/Parameters.html

Be smarter about multithreading when predicting

Currently booster::predict runs on all cores. When you predict one or few datapoints, this is much slower than single-threaded prediction due to syncronization overhead (not to mention starting all those threads).

The number of threads can be modified by setting it in the params variable in prediction like this let params = CString::new("num_threads=1").unwrap();. Currently this option is not exposed at all. An easy fix would be to add a prediction variant with a settings object (maybe with builder pattern to make it easier to add more options in later versions without breaking compatability). The more sophisticated version might be to look at how many data points were given to the predict method, and choose a reasonable number of threads based on that. It should be possible to do something reasonable by running a couple benchmarks. We would have to investigate if there's a difference between windows and unix-likes though due to different overhead of starting threads.

Add method to merge boosters

The C API contains a LGBM_BoosterMerge method. I can't find any examples of this being used (except the python library using it to copy an existing booster into an empty one), but I'm hoping it might be useful for handling ensembles of boosters trained on different data sets.

I'd imagine something in the form of pub fn merge_from(&mut self, other: &Booster), implemented on Booster, along with a unit test that trains two simple boosters, merges them and demonstrates the prediction of the merge.

API Rewrite

The current API has 2 major problems:

  • It doesn't follow ML code style conventions
  • It only implements just enough of the C FFI, got get a proof-of-concept fit-predict pipeline working
  • There is no idiomatic way to handle optional parameters or multiple validation sets.

The Code in booster.rs and dataset.rs should be rewritten to improve these points. Additionally, functionality extension should be easier, especially wrt. currently open Issues (#8, #6, #5, #3).

Guidelines to follow:

  • Use builder pattern for Booster and Dataset
  • Change function name to fit more in line with popular ML Frameworks like sk-learn

Add early stopping

Normally LightGBM supports early stopping, controlled by the early_stopping_round and first_metric_only parameters. The idea is basically to stop training once it stops making progress for some amount of time, and roll back to the round that produced the highest score on validation data (by calling LGBM_BoosterRollbackOneIter the appropriate number of times). This avoids wasting training time, and improves the model by reducing overfitting.

This depends on the metrics of #5.

If #5 implements callbacks for live updates, a minor change to allow the callback to communicate back some decisions would make it possible to implement early stopping entirely as a callback. This is the route taken by tensorflow, and makes it easy to switch in other early stopping techniques.

MacOS CI failing

In CI, Ubuntu builds seem to work while MacOS builds fail. Looks like a missing dependency?

Add Windows to CI

Currently CI only tests Linux and Ubuntu. We should add Windows. This should be relatively easy, github has a windows-latest runner, and that runner has vcpkg (and chocolately) as package managers preinstalled, so getting nessesary dependencies should be easy (we can also steal config from deepsign-client, adapting to the style of CI used here)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.