ogrisel / pygbm Goto Github PK

View Code? Open in Web Editor NEW

177.0 27.0 32.0 256 KB

Experimental Gradient Boosting Machines in Python with numba.

License: MIT License

Python 100.00%

python numba

pygbm's Introduction

pygbm

Experimental Gradient Boosting Machines in Python.

The goal of this project is to evaluate whether it's possible to implement a pure Python yet efficient version histogram-binning of Gradient Boosting Trees (possibly with all the LightGBM optimizations) while staying in pure Python 3.6+ using the numba jit compiler.

pygbm provides a set of scikit-learn compatible estimator classes that should play well with the scikit-learn Pipeline and model selection tools (grid search and randomized hyperparameter search).

Longer term plans include integration with dask and dask-ml for out-of-core and distributed fitting on a cluster.

Installation

The project is available on PyPI and can be installed with pip:

pip install pygbm

You'll need Python 3.6 at least.

Documentation

The API documentation is available at:

https://pygbm.readthedocs.io/

You might also want to have a look at the examples/ folder of this repo.

Status

The project is experimental. The API is subject to change without deprecation notice. Use at your own risk.

We welcome any feedback in the github issue tracker:

https://github.com/ogrisel/pygbm/issues

Running the development version

Use pip to install in "editable" mode:

git clone https://github.com/ogrisel/pygbm.git
cd pygbm
pip install -r requirements.txt
pip install --editable .

Run the tests with pytest:

pip install -r requirements.txt
pytest

Benchmarking

The benchmarks folder contains some scripts to evaluate the computation performance of various parts of pygbm. Keep in mind that numba's JIT compilation takes time!

Profiling

To profile the benchmarks, you can use snakeviz to get an interactive HTML report:

pip install snakeviz
python -m cProfile -o bench_higgs_boson.prof benchmarks/bench_higgs_boson.py
snakeviz bench_higgs_boson.prof

Debugging numba type inference

To introspect the results of type inference steps in the numba sections called by a given benchmarking script:

numba --annotate-html bench_higgs_boson.html benchmarks/bench_higgs_boson.py

In particular it is interesting to check that the numerical variables in the hot loops highlighted by the snakeviz profiling report have the expected precision level (e.g. float32 for loss computation, uint8 for binned feature values, ...).

Impact of thread-based parallelism

Some benchmarks can call numba functions that leverage the built-in thread-based parallelism with @njit(parallel=True) and prange loops. On a multicore machine you can evaluate how the thread-based parallelism scales by explicitly setting the NUMBA_NUM_THREAD environment variable. For instance try:

NUMBA_NUM_THREADS=1 python benchmarks/bench_binning.py

vs:

NUMBA_NUM_THREADS=4 python benchmarks/bench_binning.py

Acknowledgements

The work from Nicolas Hug is supported by the National Science Foundation under Grant No. 1740305 and by DARPA under Grant No. DARPA-BAA-16-51

The work from Olivier Grisel is supported by the scikit-learn initiative and its partners at Inria Fondation

pygbm's People

Contributors

Stargazers

Watchers

pygbm's Issues

Thread scalability is suboptimal

As reported in #30 (comment) , the scalability of pygbm is not as good as LightGBM.

Here are some results on a machine with the following CPUs:

Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz: 2 sockets each with 12 cores each which means 48 hyperthreads in total.

1 thread (sequential)

NUMBA_NUM_THREADS=1 OMP_NUM_THREADS=1 python benchmarks/bench_higgs_boson.py  --n-trees 100 --learning-rate 0.1 --n-leaf-nodes 255

Model	Time	AUC	Speed up
LightGBM	1045s	0.8282	1x
pygbm	1129s	0.8192	1x

8 threads

NUMBA_NUM_THREADS=8 OMP_NUM_THREADS=8 python benchmarks/bench_higgs_boson.py  --n-trees 100 --learning-rate 0.1 --n-leaf-nodes 255

Model	Time	AUC	Speed up
LightGBM	160s	0.8282	6.53x
pygbm	356s	0.8193	3.2x

48 (hyper)threads

python benchmarks/bench_higgs_boson.py  --n-trees 100 --learning-rate 0.1 --n-leaf-nodes 255

Model	Time	AUC	Speed up
LightGBM	91s	0.8282	11.5x
pygbm	130s	0.8193	8.7x

All of those pygbm runs used numba 0.40 from anaconda using the tbb backend (which is the fastest for pygbm).

Benchmark results with better parameters

Used a laptop for a better demo benchmark:

Intel Core i7-7700HQ (4 cores, 8 threads), unthrottled
32GB RAM DDR4 2400 MHz (dual channel)
Python 3.6, scikit-learn 0.20, numba 0.40.1

Setup for the proper benchmarking:

No LightGBM / pygbm warmup allowed
1 million training samples (10 million might crash on 64GB RAM? pygbm requires at least 24GB RAM for 1 million)
500 training iterations
255 leaves
0.05 learning rate (can change to 0.10 actually for better comparison with independent benchmarks)

The benchmark in the master branch (https://github.com/ogrisel/pygbm/blob/master/benchmarks/bench_higgs_boson.py) is way too short and doesn't exactly test the speed of whole model due to how fast it is: there are diminishing returns when the number of iterations increases, and this is what is difficult to optimize once the tree construction is already optimized.

Results:

Model	Time	AUC	Comments
LightGBM	45.260s	0.8293	Reference, runnable with 8GB RAM.
pygbm	359.101s	0.8180	Requires over 24GB RAM. Slower as more trees are added over time.

Conclusion:

pygbm is 5 to 10 times slower, but don't consider because it is slower it is worse. It is actually very fast if we compare to 2 years ago with xgboost with exact method, and as of today we can consider it competitive in speed with xgboost exact if you have enough RAM
pygbm requires way too much RAM, you will notice it only when using many iterations because it seems to increase linearly

To run the benchmark, one can use the following for a clean setup, not optimized for fastest performance but you have the pre-requisites (0.20 scikit-learn, 0.39 numba):

pip install lightgbm
pip install -U scikit-learn
pip install -U numba

git clone https://github.com/ogrisel/pygbm.git
cd pygbm

Before installing pygbm, change the following in line 147 of pygbm/grower (https://github.com/ogrisel/pygbm/blob/master/pygbm/grower.py#L146-L147):

            node.construction_speed = (node.sample_indices.shape[0] /
                                       node.find_split_time)

to:

            node.construction_speed = (node.sample_indices.shape[0] / 1.0)

Allows to avoid the infamous divide by zero error.

Then, one can run the following:

pip install --editable .

If you have slow Internet, download HIGGS dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/00280/ then uncompress it.

Then, you may run a proper benchmark using the following (make sure to change the load_path to your HIGGS csv file):

import os
from time import time
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from pygbm import GradientBoostingMachine
from lightgbm import LGBMRegressor
import numba
import gc


n_leaf_nodes = 255
n_trees = 500
lr = 0.05
max_bins = 255
load_path = "mnt/HIGGS/HIGGS.csv"
subsample = 1000000 # Change this to 10000000 if you wish, or to None

df = pd.read_csv(load_path, header=None, dtype=np.float32)
target = df.values[:, 0]
data = np.ascontiguousarray(df.values[:, 1:])
data_train, data_test, target_train, target_test = train_test_split(
    data, target, test_size=50000, random_state=0)

if subsample is not None:
    data_train, target_train = data_train[:subsample], target_train[:subsample]

n_samples, n_features = data_train.shape
print(f"Training set with {n_samples} records with {n_features} features.")

# Includes warmup time penalty
print("Fitting a LightGBM model...")
tic = time()
lightgbm_model = LGBMRegressor(n_estimators=n_trees, num_leaves=n_leaf_nodes,
                               learning_rate=lr, silent=False)
lightgbm_model.fit(data_train, target_train)
toc = time()
predicted_test = lightgbm_model.predict(data_test)
roc_auc = roc_auc_score(target_test, predicted_test)
print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}")
del lightgbm_model
del predicted_test
gc.collect()

# Includes warmup time penalty
print("Fitting a pygbm model...")
tic = time()
pygbm_model = GradientBoostingMachine(learning_rate=lr, max_iter=n_trees,
                                      max_bins=max_bins,
                                      max_leaf_nodes=n_leaf_nodes,
                                      random_state=0, scoring=None,
                                      verbose=1, validation_split=None)
pygbm_model.fit(data_train, target_train)
toc = time()
predicted_test = pygbm_model.predict(data_test)
roc_auc = roc_auc_score(target_test, predicted_test)
print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}")
del pygbm_model
del predicted_test
gc.collect()


if hasattr(numba, 'threading_layer'):
    print("Threading layer chosen: %s" % numba.threading_layer())

If something is missing in the script, please let me know.

Remove empty slice check (numba fixed the issue)

numba/numba#3554 was fixed so we can remove our temporary fix from #51 once the next version is released.

Places to fix (ATM):

~/dev/pygbm » ag "array\[:0\] will"     
pygbm/splitting.py
250:        # array[:0] will cause a bug in numba 0.41 so we need the if. Remove

pygbm/utils.py
80:        # array[:0] will cause a bug in numba 0.41 so we need the if.

Avoid ordered_gradients?

Again, still digesting the code, but maybe you thought about this. Can ordered_gradients be avoided? Can we reorder in place in the gradient array, the tree grower can put it back in place at the end of .grow(). I see no place where the original order of gradient order matters except in gradient_boosting.py.

API documentation is broken

See: https://pygbm.readthedocs.io/en/latest/

I am not sure what is going on.

Optimize categorical crossentropy gradient update

The gradient and hessian update of the categorical crossentropy loss computes p_k k times, but it only needs to compute it once (see scikit-learn/scikit-learn@9e68984 which led to serious improvement in fit time).

Enhance binning strategy

Results are comparable to LightGBM when n_samples <= n_bins because both libs are using the actual feature values as bin thresholds.

This is not the case anymore when n_samples > n_bins. In particular, on this very easy dataset (target = X[:, 0] > 0, lightgbm finds a perfect threshold of 1e-35 while that of pygbm is -0.262. This leads to different trees and less accurate predictions (1 vs .9).

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
import numpy as np
from pygbm import GradientBoostingMachine
from lightgbm import LGBMClassifier
from pygbm.plotting import plot_tree

rng = np.random.RandomState(seed=2)

n_leaf_nodes = 5
n_trees = 1
lr = 1.
min_samples_leaf = 1

max_bins = 5
n_samples = 100

X = rng.normal(size=(n_samples, 5))
y = (X[:, 0] > 0)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)

pygbm_model = GradientBoostingMachine(
    loss='log_loss', learning_rate=lr, max_iter=n_trees, max_bins=max_bins,
    max_leaf_nodes=n_leaf_nodes, random_state=0, scoring=None, verbose=1,
    validation_split=None, min_samples_leaf=min_samples_leaf)
pygbm_model.fit(X_train, y_train)
predicted_test = pygbm_model.predict(X_test)
acc = accuracy_score(y_test, predicted_test)
print(acc)

lightgbm_model = LGBMClassifier(
    objective='binary', n_estimators=n_trees, max_bin=max_bins,
    num_leaves=n_leaf_nodes, learning_rate=lr, verbose=10, random_state=0,
    boost_from_average=False, min_data_in_leaf=min_samples_leaf)
lightgbm_model.fit(X_train, y_train)
predicted_test = lightgbm_model.predict(X_test)
acc = accuracy_score(y_test, predicted_test)
print(acc)

plot_tree(pygbm_model, lightgbm_model, view=True)

Updating to Scipy 1.2.0 breaks loss tests...

will investigate

Use cache=True on everything?

Opened numba/numba#3588 to ask if there's an alternative.

Investigate the discrepancy in default hyperparams compared to LightGBM

Possible culprits:

shrinkage / learning_rate
min_samples_leaf
min_child_samples

See details in #30 (comment).

Implement support for missing values

In particular for binned numerical features, we need to reserve a specific bin as a marker for missing values. I believe this is what LightGBM does.

Remove constant_hessian_value?

I think we can drop the constant_hessian_value in the SplittingContext, and always assume the constant hessians value is 1. We just have to set the gradients value accordingly, to have consistent leaf values.

This is what we do for LS loss already. WDYT @ogrisel ?

Reuse grower (and thus the splitter) instead of creating a new one

A new grower (and a new SplittingContext) is created at each iteration which may cause a memory use peak on large datasets (#79).

Instead of instanciating a new grower, we could just reset it and reuse the allocated space.

See #79 (comment) to evaluate the changes

Parallelize predictions with static scheduling

Not sure if this will have a high impact on the time, but in predictions functions we could replace the parallel loop over all samples

for i in prange(numeric_data.shape[0]):

to a static scheduling to take advantage of the cache.

This is probably only going to be significant if scoring is not None because in this case _predict_binned will be called at every iteration.

Allocate ordered_gradients and ordered_hessians only once

Opening this issue so that we don't forget there is room for improvement on this matter.

We could allocate ordered_gradients and ordered_hessians once and for all in the SplittingContext object instead of doing the allocation for every node.

Also, we could populate them in parallel.

I tried doing this in 1b073b1. It passes the tests but it still allocates the arrays for each node because I forgot to remove this line (><).

When I did remove it, and after having patched places where ordered_gradients is used (e.g. changing sum_gradients = ordered_gradients.sum() intosum_gradients = ordered_gradients[:n_samples].sum()), the test don't pass anymore. It could be a numba bug similar to numba/numba#3459, but I'm not certain I didn't miss something myself.

Rename y_pred to make it clear that those are the trees values

The y_pred variable that we use in BaseGradientBoosting is not necessarily the actual prediction. It's only the values predicted by the trees, to be later converted to a class if needed.

Change it to e.g. raw_predictions

mean_samples_leaf does not do what it's suppose to do

min_samples_leaf=20 should ensure that we never do a split that would result in less than 20 samples in each of the two resulting leaves.

Our current implementation does not split nodes with less than 20 samples which is not the same. Our current implementation is akin to the min_samples_split of scikit-learn trees which is not a good hyperparameter to control over-fitting.

I am working on a fix.

Initialize predictions to something better than just 0

Saving it here so we remember: instead of initializing y_pred to 0 we may want to do something more sensible like boosting from the average target, or even accept an init predictor.

numba-integration-test failure

The numba-integration-tests failed and I am trying to figure out, if it is a legit failure or if we introduced a new bug in Numba

The failure is

=================================== FAILURES ===================================
_________________ test_derivatives[binary_crossentropy-0.3-0] __________________

loss = <pygbm.loss.BinaryCrossEntropy object at 0x7f4f2853da50>
x0 = array([[0.3]], dtype=float32), y_true = array([0.], dtype=float32)

    @pytest.mark.parametrize('loss, x0, y_true', [
        ('least_squares', -2., 42),
        ('least_squares', 117., 1.05),
        ('least_squares', 0., 0.),
        ('binary_crossentropy', 0.3, 0),
        ('binary_crossentropy', -12, 1),
        ('binary_crossentropy', 30, 1),
    ])
    def test_derivatives(loss, x0, y_true):
        # Check that gradients are zero when the loss is minimized on 1D array
        # using the Newton-Raphson and the first and second order derivatives
        # computed by the Loss instance.
    
        loss = _LOSSES[loss]()
        y_true = np.array([y_true], dtype=np.float32)
        x0 = np.array([x0], dtype=np.float32).reshape(1, 1)
        get_gradients, get_hessians = get_derivatives_helper(loss)
    
        def func(x):
            return loss(y_true, x)
    
        def fprime(x):
            return get_gradients(y_true, x)
    
        def fprime2(x):
            return get_hessians(y_true, x)
    
>       optimum = newton(func, x0=x0, fprime=fprime, fprime2=fprime2)

tests/test_loss.py:78: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

func = <function test_derivatives.<locals>.func at 0x7f4f0c4e6440>
x0 = array([[0.3]], dtype=float32)
fprime = <function test_derivatives.<locals>.fprime at 0x7f4f0c4e67a0>
args = (), tol = 1.48e-08, maxiter = 50
fprime2 = <function test_derivatives.<locals>.fprime2 at 0x7f4f0c4e65f0>
x1 = None, rtol = 0.0, full_output = False, disp = True

    def newton(func, x0, fprime=None, args=(), tol=1.48e-8, maxiter=50,
               fprime2=None, x1=None, rtol=0.0,
               full_output=False, disp=True):
        """
        Find a zero of a real or complex function using the Newton-Raphson
        (or secant or Halley's) method.
    
        Find a zero of the function `func` given a nearby starting point `x0`.
        The Newton-Raphson method is used if the derivative `fprime` of `func`
        is provided, otherwise the secant method is used.  If the second order
        derivative `fprime2` of `func` is also provided, then Halley's method is
        used.
    
        If `x0` is a sequence with more than one item, then `newton` returns an
        array, and `func` must be vectorized and return a sequence or array of the
        same shape as its first argument. If `fprime` or `fprime2` is given then
        its return must also have the same shape.
    
        Parameters
        ----------
        func : callable
            The function whose zero is wanted. It must be a function of a
            single variable of the form ``f(x,a,b,c...)``, where ``a,b,c...``
            are extra arguments that can be passed in the `args` parameter.
        x0 : float, sequence, or ndarray
            An initial estimate of the zero that should be somewhere near the
            actual zero. If not scalar, then `func` must be vectorized and return
            a sequence or array of the same shape as its first argument.
        fprime : callable, optional
            The derivative of the function when available and convenient. If it
            is None (default), then the secant method is used.
        args : tuple, optional
            Extra arguments to be used in the function call.
        tol : float, optional
            The allowable error of the zero value.  If `func` is complex-valued,
            a larger `tol` is recommended as both the real and imaginary parts
            of `x` contribute to ``|x - x0|``.
        maxiter : int, optional
            Maximum number of iterations.
        fprime2 : callable, optional
            The second order derivative of the function when available and
            convenient. If it is None (default), then the normal Newton-Raphson
            or the secant method is used. If it is not None, then Halley's method
            is used.
        x1 : float, optional
            Another estimate of the zero that should be somewhere near the
            actual zero.  Used if `fprime` is not provided.
        rtol : float, optional
            Tolerance (relative) for termination.
        full_output : bool, optional
            If `full_output` is False (default), the root is returned.
            If True and `x0` is scalar, the return value is ``(x, r)``, where ``x``
            is the root and ``r`` is a `RootResults` object.
            If True and `x0` is non-scalar, the return value is ``(x, converged,
            zero_der)`` (see Returns section for details).
        disp : bool, optional
            If True, raise a RuntimeError if the algorithm didn't converge, with
            the error message containing the number of iterations and current
            function value.  Otherwise the convergence status is recorded in a
            `RootResults` return object.
            Ignored if `x0` is not scalar.
            *Note: this has little to do with displaying, however
            the `disp` keyword cannot be renamed for backwards compatibility.*
    
        Returns
        -------
        root : float, sequence, or ndarray
            Estimated location where function is zero.
        r : `RootResults`, optional
            Present if ``full_output=True`` and `x0` is scalar.
            Object containing information about the convergence.  In particular,
            ``r.converged`` is True if the routine converged.
        converged : ndarray of bool, optional
            Present if ``full_output=True`` and `x0` is non-scalar.
            For vector functions, indicates which elements converged successfully.
        zero_der : ndarray of bool, optional
            Present if ``full_output=True`` and `x0` is non-scalar.
            For vector functions, indicates which elements had a zero derivative.
    
        See Also
        --------
        brentq, brenth, ridder, bisect
        fsolve : find zeros in n dimensions.
    
        Notes
        -----
        The convergence rate of the Newton-Raphson method is quadratic,
        the Halley method is cubic, and the secant method is
        sub-quadratic.  This means that if the function is well behaved
        the actual error in the estimated zero after the n-th iteration
        is approximately the square (cube for Halley) of the error
        after the (n-1)-th step.  However, the stopping criterion used
        here is the step size and there is no guarantee that a zero
        has been found. Consequently the result should be verified.
        Safer algorithms are brentq, brenth, ridder, and bisect,
        but they all require that the root first be bracketed in an
        interval where the function changes sign. The brentq algorithm
        is recommended for general use in one dimensional problems
        when such an interval has been found.
    
        When `newton` is used with arrays, it is best suited for the following
        types of problems:
    
        * The initial guesses, `x0`, are all relatively the same distance from
          the roots.
        * Some or all of the extra arguments, `args`, are also arrays so that a
          class of similar problems can be solved together.
        * The size of the initial guesses, `x0`, is larger than O(100) elements.
          Otherwise, a naive loop may perform as well or better than a vector.
    
        Examples
        --------
        >>> from scipy import optimize
        >>> import matplotlib.pyplot as plt
    
        >>> def f(x):
        ...     return (x**3 - 1)  # only one real root at x = 1
    
        ``fprime`` is not provided, use the secant method:
    
        >>> root = optimize.newton(f, 1.5)
        >>> root
        1.0000000000000016
        >>> root = optimize.newton(f, 1.5, fprime2=lambda x: 6 * x)
        >>> root
        1.0000000000000016
    
        Only ``fprime`` is provided, use the Newton-Raphson method:
    
        >>> root = optimize.newton(f, 1.5, fprime=lambda x: 3 * x**2)
        >>> root
        1.0
    
        Both ``fprime2`` and ``fprime`` are provided, use Halley's method:
    
        >>> root = optimize.newton(f, 1.5, fprime=lambda x: 3 * x**2,
        ...                        fprime2=lambda x: 6 * x)
        >>> root
        1.0
    
        When we want to find zeros for a set of related starting values and/or
        function parameters, we can provide both of those as an array of inputs:
    
        >>> f = lambda x, a: x**3 - a
        >>> fder = lambda x, a: 3 * x**2
        >>> np.random.seed(4321)
        >>> x = np.random.randn(100)
        >>> a = np.arange(-50, 50)
        >>> vec_res = optimize.newton(f, x, fprime=fder, args=(a, ))
    
        The above is the equivalent of solving for each value in ``(x, a)``
        separately in a for-loop, just faster:
    
        >>> loop_res = [optimize.newton(f, x0, fprime=fder, args=(a0,))
        ...             for x0, a0 in zip(x, a)]
        >>> np.allclose(vec_res, loop_res)
        True
    
        Plot the results found for all values of ``a``:
    
        >>> analytical_result = np.sign(a) * np.abs(a)**(1/3)
        >>> fig = plt.figure()
        >>> ax = fig.add_subplot(111)
        >>> ax.plot(a, analytical_result, 'o')
        >>> ax.plot(a, vec_res, '.')
        >>> ax.set_xlabel('$a$')
        >>> ax.set_ylabel('$x$ where $f(x, a)=0$')
        >>> plt.show()
    
        """
        if tol <= 0:
            raise ValueError("tol too small (%g <= 0)" % tol)
        if maxiter < 1:
            raise ValueError("maxiter must be greater than 0")
        if np.size(x0) > 1:
            return _array_newton(func, x0, fprime, args, tol, maxiter, fprime2,
                                 full_output)
    
        # Convert to float (don't use float(x0); this works also for complex x0)
        p0 = 1.0 * x0
        funcalls = 0
        if fprime is not None:
            # Newton-Raphson method
            for itr in range(maxiter):
                # first evaluate fval
                fval = func(p0, *args)
                funcalls += 1
                # If fval is 0, a root has been found, then terminate
                if fval == 0:
                    return _results_select(
                        full_output, (p0, funcalls, itr, _ECONVERGED))
                fder = fprime(p0, *args)
                funcalls += 1
                if fder == 0:
                    msg = "Derivative was zero."
                    if disp:
                        msg += (
                            " Failed to converge after %d iterations, value is %s."
                            % (itr + 1, p0))
>                       raise RuntimeError(msg)
E                       RuntimeError: Derivative was zero. Failed to converge after 46 iterations, value is [[-89.88232]].

../miniconda3/envs/pygbm/lib/python3.7/site-packages/scipy/optimize/zeros.py:294: RuntimeError

And the full log is:

https://circleci.com/gh/numba/numba-integration-testing/1120

scikit-learn 0.20 is required

Please add into your dependencies that scikit-learn 0.20 is required.

Otherwise, it throws the following error:

ImportError: cannot import name 'check_scoring'

Which is only available from scikit-learn 0.20 (latest stable version).

All the examples require lightgbm

I tried to run the examples but they all install lightgbm which is not a requirement. Maybe make them optional?

Optionally use left/right indices buffer

continuing from #79

In splitting.py, the left/right_indices_buffer will use up 8GB for 10^9 rows. If that causes swapping, the performance benefit of multithreading (which requires these buffers) are most probably not worth it. Would it be an option to disable this?
Is there also a method that could without the buffer (I'm still wrapping my head around the algo, maybe you already thought about it).

Implement support for sparse feature data

For instance if all the data is passed as a scipy.sparse.csc_matrix (e.g. after one hot encoding).

Pandas as support for sparse features: http://pandas.pydata.org/pandas-docs/stable/sparse.html

In particular it has dedicated datastructure for 1D sparse data: SparseArray.

There is also: https://github.com/pydata/sparse and I believe the ecosystem will converge at some point. I would be in favor of leveraging the datastracture from Pandas to start with the most adopted solutions that allows for heterogeneously typed features (a fix of dense and sparse columns, categorical or numerical).

sum_gradient and sum_hessians computation in find_node_split_subtraction

Instead of summing over all the bins of an arbitrary feature histogram to compute
context.sum_gradients and context.sum_hessians, we can instead directly pass those values to find_node_split_subtraction since the parent's split_info already contains gradient_left/right and hessian_left/right.

We're summing over 255 bins max so the gain is very minimal, but it would be clearer, and more accurate.

Usage of scorer is incorrect

We use scorer._score_func(y_train, predicted_train) where predicted_train = self._predict_binned(X_binned_train).

predict_binned outputs classes for classification. This won't work for e.g. scoring='neg_log_loss' which requires probabilities from predict_proba().

One solution is to get rid of scoring and check early stopping based on the loss.

If we want to keep scoring, I think we'll need to always use scorer.__call__() instead of scorer._score_func(), so that we don't reinvent the wheel.

The signature of scorer.__call__() is estimator, X, y, so we'll need to pass self. I think this is doable. I see 2 options:

We pass X_binned_train to scorer.__call__() and predict() (and predict_proba) has to figure out whether its input X is pre-binned or not, e.g. by checking that the dtype is uint8. This forces users to never use uint8 arrays.
We pass X_train instead of X_binned_train. This is slightly slower.

In both cases it's better to have a C-contiguous X for predictions. It's faster than with Fortran arrays. Here are 3 runs of bench_predictors.py from #62 (5e6 samples)

Pre-binned F .5456s
Pre-binned C .5287s
numerical F .7616s
numerical C .5716s

Pre-binned F .4914s
Pre-binned C .4690s
numerical F .6745s
numerical C .5033s

Pre-binned F .5019s
Pre-binned C .4724s
numerical F .6788s
numerical C .5099s

Implement native support for missing values

as done in the Cython version in scikit-learn: scikit-learn/scikit-learn#13911

Status of this project?

Hi,
Are people still working on this project? Is this going anywhere? If not, what were the issues?
Just wondering, saw that latest commit is >6 month, but it's such a cool project.

Thanks,
Jonathan

Updating to numba 0.41 breaks everything

:'(

Unify names of X_binned, features_data and binned_features

X_binned in GradientBoosting is named features_data in the grower and binned_features in the splitting context.

How about we use instead:

X_binned in every class
or X_binned in GradientBoosting and binned_features in both the grower and splitting context

As a side note, passing n_features to the splitting context is redundant since it's just binned_features.shape[0].

_update_raw_predictions() throws a deprecation warning

Passing lists to njitted functions is deprecated by numba and will soon stop being supported.

_update_raw_predictions throws a deprecation warning now.

I'm not sure what the workaround is. I opened numba/numba#4276 to ask the Numba folks

Recent Numba not usable with pygbm

I am seeing the following:

::>> running: 'conda run -n pygbm pytest'
ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ['pytest']' command failed.  (See above for error)
============================= test session starts ==============================
platform linux -- Python 3.8.5, pytest-6.1.1, py-1.9.0, pluggy-0.13.1
rootdir: /home/circleci/repo/pygbm
collected 0 items / 9 errors

==================================== ERRORS ====================================
____________________ ERROR collecting tests/test_binning.py ____________________
ImportError while importing test module '/home/circleci/repo/pygbm/tests/test_binning.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../miniconda3/envs/pygbm/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_binning.py:5: in <module>
    from pygbm.binning import BinMapper, _find_binning_thresholds, _map_to_bins
pygbm/__init__.py:1: in <module>
    from pygbm.gradient_boosting import GradientBoostingClassifier
pygbm/gradient_boosting.py:18: in <module>
    from pygbm.grower import TreeGrower
pygbm/grower.py:11: in <module>
    from .splitting import (SplittingContext, split_indices, find_node_split,
pygbm/splitting.py:9: in <module>
    from numba import njit, jitclass, prange, float32, uint8, uint32
E   ImportError: cannot import name 'jitclass' from 'numba' (/home/circleci/repo/miniconda3/envs/pygbm/lib/python3.8/site-packages/numba/__init__.py)
_______________ ERROR collecting tests/test_compare_lightgbm.py ________________
ImportError while importing test module '/home/circleci/repo/pygbm/tests/test_compare_lightgbm.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../miniconda3/envs/pygbm/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_compare_lightgbm.py:7: in <module>
    from pygbm import GradientBoostingRegressor, GradientBoostingClassifier
pygbm/__init__.py:1: in <module>
    from pygbm.gradient_boosting import GradientBoostingClassifier
pygbm/gradient_boosting.py:18: in <module>
    from pygbm.grower import TreeGrower
pygbm/grower.py:11: in <module>
    from .splitting import (SplittingContext, split_indices, find_node_split,
pygbm/splitting.py:9: in <module>
    from numba import njit, jitclass, prange, float32, uint8, uint32
E   ImportError: cannot import name 'jitclass' from 'numba' (/home/circleci/repo/miniconda3/envs/pygbm/lib/python3.8/site-packages/numba/__init__.py)
_______________ ERROR collecting tests/test_gradient_boosting.py _______________
ImportError while importing test module '/home/circleci/repo/pygbm/tests/test_gradient_boosting.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../miniconda3/envs/pygbm/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_gradient_boosting.py:8: in <module>
    from pygbm import GradientBoostingClassifier
pygbm/__init__.py:1: in <module>
    from pygbm.gradient_boosting import GradientBoostingClassifier
pygbm/gradient_boosting.py:18: in <module>
    from pygbm.grower import TreeGrower
pygbm/grower.py:11: in <module>
    from .splitting import (SplittingContext, split_indices, find_node_split,
pygbm/splitting.py:9: in <module>
    from numba import njit, jitclass, prange, float32, uint8, uint32
E   ImportError: cannot import name 'jitclass' from 'numba' (/home/circleci/repo/miniconda3/envs/pygbm/lib/python3.8/site-packages/numba/__init__.py)
____________________ ERROR collecting tests/test_grower.py _____________________
ImportError while importing test module '/home/circleci/repo/pygbm/tests/test_grower.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../miniconda3/envs/pygbm/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_grower.py:7: in <module>
    from pygbm.grower import TreeGrower
pygbm/__init__.py:1: in <module>
    from pygbm.gradient_boosting import GradientBoostingClassifier
pygbm/gradient_boosting.py:18: in <module>
    from pygbm.grower import TreeGrower
pygbm/grower.py:11: in <module>
    from .splitting import (SplittingContext, split_indices, find_node_split,
pygbm/splitting.py:9: in <module>
    from numba import njit, jitclass, prange, float32, uint8, uint32
E   ImportError: cannot import name 'jitclass' from 'numba' (/home/circleci/repo/miniconda3/envs/pygbm/lib/python3.8/site-packages/numba/__init__.py)
___________________ ERROR collecting tests/test_histogram.py ___________________
ImportError while importing test module '/home/circleci/repo/pygbm/tests/test_histogram.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../miniconda3/envs/pygbm/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_histogram.py:5: in <module>
    from pygbm.histogram import _build_histogram_naive
pygbm/__init__.py:1: in <module>
    from pygbm.gradient_boosting import GradientBoostingClassifier
pygbm/gradient_boosting.py:18: in <module>
    from pygbm.grower import TreeGrower
pygbm/grower.py:11: in <module>
    from .splitting import (SplittingContext, split_indices, find_node_split,
pygbm/splitting.py:9: in <module>
    from numba import njit, jitclass, prange, float32, uint8, uint32
E   ImportError: cannot import name 'jitclass' from 'numba' (/home/circleci/repo/miniconda3/envs/pygbm/lib/python3.8/site-packages/numba/__init__.py)
_____________________ ERROR collecting tests/test_loss.py ______________________
ImportError while importing test module '/home/circleci/repo/pygbm/tests/test_loss.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../miniconda3/envs/pygbm/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_loss.py:8: in <module>
    from pygbm.loss import _LOSSES
pygbm/__init__.py:1: in <module>
    from pygbm.gradient_boosting import GradientBoostingClassifier
pygbm/gradient_boosting.py:18: in <module>
    from pygbm.grower import TreeGrower
pygbm/grower.py:11: in <module>
    from .splitting import (SplittingContext, split_indices, find_node_split,
pygbm/splitting.py:9: in <module>
    from numba import njit, jitclass, prange, float32, uint8, uint32
E   ImportError: cannot import name 'jitclass' from 'numba' (/home/circleci/repo/miniconda3/envs/pygbm/lib/python3.8/site-packages/numba/__init__.py)
___________________ ERROR collecting tests/test_plotting.py ____________________
ImportError while importing test module '/home/circleci/repo/pygbm/tests/test_plotting.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../miniconda3/envs/pygbm/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_plotting.py:5: in <module>
    from pygbm.grower import TreeGrower
pygbm/__init__.py:1: in <module>
    from pygbm.gradient_boosting import GradientBoostingClassifier
pygbm/gradient_boosting.py:18: in <module>
    from pygbm.grower import TreeGrower
pygbm/grower.py:11: in <module>
    from .splitting import (SplittingContext, split_indices, find_node_split,
pygbm/splitting.py:9: in <module>
    from numba import njit, jitclass, prange, float32, uint8, uint32
E   ImportError: cannot import name 'jitclass' from 'numba' (/home/circleci/repo/miniconda3/envs/pygbm/lib/python3.8/site-packages/numba/__init__.py)
___________________ ERROR collecting tests/test_predictor.py ___________________
ImportError while importing test module '/home/circleci/repo/pygbm/tests/test_predictor.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../miniconda3/envs/pygbm/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_predictor.py:9: in <module>
    from pygbm.grower import TreeGrower
pygbm/__init__.py:1: in <module>
    from pygbm.gradient_boosting import GradientBoostingClassifier
pygbm/gradient_boosting.py:18: in <module>
    from pygbm.grower import TreeGrower
pygbm/grower.py:11: in <module>
    from .splitting import (SplittingContext, split_indices, find_node_split,
pygbm/splitting.py:9: in <module>
    from numba import njit, jitclass, prange, float32, uint8, uint32
E   ImportError: cannot import name 'jitclass' from 'numba' (/home/circleci/repo/miniconda3/envs/pygbm/lib/python3.8/site-packages/numba/__init__.py)
___________________ ERROR collecting tests/test_splitting.py ___________________
ImportError while importing test module '/home/circleci/repo/pygbm/tests/test_splitting.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../miniconda3/envs/pygbm/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_splitting.py:6: in <module>
    from pygbm.splitting import _find_histogram_split
pygbm/__init__.py:1: in <module>
    from pygbm.gradient_boosting import GradientBoostingClassifier
pygbm/gradient_boosting.py:18: in <module>
    from pygbm.grower import TreeGrower
pygbm/grower.py:11: in <module>
    from .splitting import (SplittingContext, split_indices, find_node_split,
pygbm/splitting.py:9: in <module>
    from numba import njit, jitclass, prange, float32, uint8, uint32
E   ImportError: cannot import name 'jitclass' from 'numba' (/home/circleci/repo/miniconda3/envs/pygbm/lib/python3.8/site-packages/numba/__init__.py)

The fix is easy, jitclass must now be imported so: from numba.experimental import jitclass. I can submit a patch if need be.

Implement ONNX export

https://github.com/onnx/onnxmltools has exports for lightgbm and scikit-learn decision trees.

It would be great to add onnx exports built-in pygbm, for instance in a new pygbm.onnx module (that is not imported by default so as to make the dependency on the onnx package optional).

This would also make it possible to improve the checks in test_compare_lightgbm so as to check that the structure of the generated trees are the same.

Excessive memory usage when the number of trees is large (e.g. 500 trees)

As investigated in #30, when the number of trees is large, htop reports VIRT memory that exceeds the physical memory of the host and causes computer hangs that slows down computation.

What is weird is that RES memory seems to stay constant.

Optimize score loss computation

Slightly related to #76

This is the second bullet point from #69 (comment)

When early stopping (or just score monitoring) is done on the training data with the loss, we should just use the raw_predictions array from fit() instead of re-computing it.

Results would be slightly different from the current implementation because we are currently computing the loss on a subset of the training data, not on the whole training data.

A further optimization would be, instead of calling loss_.__call__(), to compute the loss w.r.t each sample in e.g. loss_.update_gradients_and_hessians and use those values to compute the gradients and hessians. Overhead would be minimal this way.

did you stopped development since since can not do better than lightGBM pr Xgboost pr catboost?

Write tests for the held out validation data and early stopping

The validation data and early stopping mechanisms are not covered by the tests.

Implement ONNX import

Once #23 (ONNX export) is implemented, adding ONNX imports would be nice to help with interoperability testing.

Parallel splitting fails in nopython mode

When running the tests and benchmarks I get the following error :

File "pygbm/splitting.py", line 253:

def split_indices(context, split_info, sample_indices):
    <source elided>
        sizes[:n_samples % n_threads] += 1
    offset_in_buffers = np.zeros(n_threads, dtype=np.int32)
    ^

but it can be solved by using parallel=False in the split_indices decorator

@njit(parallel=False,
      locals={'sample_idx': uint32,
              'left_count': uint32,
              'right_count': uint32})
def split_indices(context, split_info, sample_indices):

but it's certainly shameful to remove parallelization here :(

I'm using numba 0.51.2. Any ideas on how to solve it ?

Implement support for categorical variables

pygbm should accept pandas.DataFrame instances as in X in gbm.fit(X, y) (as well as predict, in addition to numpy arrays).

This would make it more natural to accept heterogeneously typed features in the training data.

I am not sure what is the best strategy is to deal with categorical variables. Reviewing the strategy implemented in LightGBM and XGBoost would be helpful.

Allow score monitoring regardless of early stopping

As mentioned in #75, it'd be nice to allow score monitoring (both scoring and loss values on train / validation data) regardless of early stopping.

Use the training (or validation) loss value for early stopping by default

I think we should have scoring=None for all Gradient Boosting estimators by default and still enable early stopping.

if self.validation_split is not None, call self.loss_(y_validation, raw_predictions_validation) instead of a Scorer instance.
if self.validation_split is None, find a way to incrementally update an estimator of the full training loss without adding any additional computation cost (I think it should be doable but I have not checked in detail).
if n_iter_no_change is None or n_iter_no_change == 0 disable early stopping and go to max_iter.

@NicolasHug WDYT?

Implement histogram recycling to improve memory efficiency

As implemented in the Cython version in scikit-learn: scikit-learn/scikit-learn#18242

Passing pre-binned data to `fit()`

I think that it is not possible to pass pre-binned data into fit(), or at least that results may not be as expected.

This is the case both in the current implementation, and also with the changes in #65.

Let's say we pre-bin X into X_binned:

X_binned = initial_mapper.fit_transform(X)

and then call fit(X_binned).

One might expect predict(X) to output the same as predict_binned(X_binned), but that's not the case.

The reason is that in fit() the predictor is built with

predictor = grower.make_predictor(bin_thresholds=self.bin_mapper_.bin_thresholds_)

and despite the idempotence of the mapper, self.bin_mapper_.bin_thresholds_ is completely different from initial_mapper.bin_thresholds_.

Do we want to raise an error if a uint8 array is passed to fit to avoid confusion?
Do we want to allow passing pre-binned data and bypass the mapping in fit()? That would require passing bin_thresholds to fit() I think.

I'd go for the first option.

I'm also thinking about adding this test to clarify the behavior. It passes on #65 branch:

    X, y = make_regression()
    gbdt = GradientBoostingRegressor(scoring=None, random_state=0)
    mapper = BinMapper(random_state=0)
    X_binned = mapper.fit_transform(X)


    with pytest.raises(AssertionError):
        gbdt.fit(X_binned, y)
        np.testing.assert_array_almost_equal(gbdt.predict(X),
                                             gbdt.predict(X_binned))

    pred_X = gbdt.fit(X, y).predict(X)
    pred_X_binned = gbdt.fit(X_binned, y).predict(X_binned)
    np.testing.assert_array_almost_equal(pred_X, pred_X_binned)