stanfordmlgroup / ngboost Goto Github PK

View Code? Open in Web Editor NEW

1.6K 44.0 217.0 12.49 MB

Natural Gradient Boosting for Probabilistic Prediction

License: Apache License 2.0

Python 96.24% Shell 3.51% Makefile 0.26%

machine-learning gradient-boosting natural-gradients uncertainty-estimation ngboost python

ngboost's People

Contributors

Stargazers

Watchers

Forkers

kshramt statmixedml xifengbishu aihill matsuken92 guyko81 rikima arita37 liannice 2298265528abc yuanjie-ai kyosek ghostintheshellarise blaxe05 cavayangtao chinmay-vadgama rafmacalaba ericmjl rezabehzadpour cooperleong00 valeman swapnil-kotecha anton4i statunizaga mlko53 baiziru yutayamazaki useric wptmdoorn giangdip2410 kryptonite0 hopesdad shoman2 cdagnino vinicius-ianni wiserxin piothg julianocristian phillip1029 lystahi strategist922 willsmithsky happy-hui1 nurlanmammadov92 rehooley holyqin hxd783148970 mahat earthson 90217 manujosephv defang zhiruiwang lololovely flysky1991 joseph-cogni carlitosdev yuv4r4j jiapei100 wakame1367 justinpaulturner yu9824 themrzmaster qhapaq-49 qiming-zou paurichardson arkp1612 jmburley foundryai alejandroschuler ynkwon ethansyh mshahrfar data-analysis-tian tokoroten daigo0927 macklin-fluehr kkpsiren takuyats yujiariyasu kforeman vascomedici teemuronkko vishalbelsare palexbg joe-nano astrogilda chengjiun cxz usteiner9 kirstenlin david082 ml-ai-nlp-ir btatkinson aileowang ajith-shenoy nanaakwasiabayieboateng jongkook-heo ryan-wolbeck manikant92

ngboost's Issues

Discrete Burr XII and Skew-T distributions

Moving discussion here from #60 (comment)

@cooberp @kmedved here is some scaffold code that I'm hoping you folks can fill in/modify to get your distributions up and running:

from ngboost.distns import Distn
import numpy as np

class Generalized3ParamBurrXIIDiscrete(Distn):

    n_params = 3 

    def __init__(self, params):
        self.params_ = params # all real numbers 
        self.mu, self.sigma, self.mu = [exp(p) for p in params] # transform to positives

    def sample(self, m):
        """
        Code to draw m random samples from the distribution. Each sample will have 
        n = len(self) = len(self.mu) elements, so the output should be m x n        
        """
        return None

    def fit(Y):
        """
        Code to fit a *single* Discrete Burr XII distribution to data Y. 
        Needs to return parameters mu, sigma, and nu in a numpy array.

        This is just for initialization of the NGBoost algorithm, so it doesn't need to be perfect. 
        Ballpark is good enough.
        """
        mu, sigma, nu = None, None, None
        return np.array([mu, sigma, nu])

    # log score methods
    def nll(self, Y): 
        """
        log-likelihood (per observation). 
        Returns a vector of length n = len(self)
        """
        return -np.log(((1 + (Y/self.mu)**self.sigma)**(-self.nu)) - ((1 + ((Y+1)/self.mu)**self.sigma)**(-self.nu)))

    def D_nll(self, Y):
        """
        Returns the derivative of self.nll() with respect to each of the real-valued 
        parameters [log(mu), log(sigma), log(nu)]. 

        These can be easily calculated using, e.g., wolframalpha, and efficiently implemented here.
        """
        d_log_mu = np.zeros_like(self.mu)
        d_log_sigma = np.zeros_like(self.sigma)
        d_log_nu = np.zeros_like(self.nu)
        return np.array([dmu, dsigma, dnu])

This is for Discrete Burr XII, but the equivalent code should also do to implement your skew-t distribution.

D_nll() should be easy. Just copy-paste the nll into wolframalpha, edit the variable names, and ask for derivatives. Copy-paste back to D_nll, re-edit the variable names, and call it a day.

The biggest challenge, I think, will be implementing the sample() method, which is necessary if you don't want to derive/implement the Fisher Information. I was working on this myself but didn't have luck with anything simple. As you know, the distribution isn't already implemented in scipy.stats or another python package. scipy.stats does have a Burr XII, which I hoped to sample from and then use np.floor() on to get the discrete version, but then I noticed that what they call Burr XII has a different number of parameters than the pmf you gave me, which I think corresponds to some 3-parameter "generalized" version of the (discretized) Burr XII... All of which is fine, if that's what you want, but the upshot is that there isn't a pre-implemented version to sample from or an easy way to make one.

On the other hand, I don't think this is at all an insurmountable challenge. Making a sampling algorithm for a "custom" distribution is fairly straightforward using, e.g., inverse transform sampling. All you need to do is calculate the inverse CDF (use wolframalpha or whatever) and implement that. And if that doesn't work, there are other methods. All in all, still probably easier than deriving the Fisher.

The fit() method is also not necessarily trivial, but feel free to use whatever heuristics you want since it's just for initialization. Or go wild and implement/call some optimization method of your choosing.

I haven't looked into skew-T that closely, but the same general ideas should apply. And if the proper sample() method is already implemented somewhere, your job will be easier.

Since this is all a little bit of a challenge and not distributions others will likely use, I'm hoping you two (and/or your collaborators) can give me a hand here and give this a shot. But please do let me know where/if you get stuck and I will jump in to rescue as necessary!

NGBClassifier predict error

Running your code from https://github.com/stanfordmlgroup/ngboost/blob/master/examples/classification.py
I received the following error when I try

ngb.predict(X_test)

`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in ()
----> 1 ngb.predict(X_test)

/usr/local/lib/python3.6/dist-packages/ngboost/ngboost.py in predict(self, X)
133 def predict(self, X):
134 dist = self.pred_dist(X)
--> 135 return list(dist.loc.flatten())
136
137 def score(self, X, Y):

AttributeError: 'NoneType' object has no attribute 'flatten'
----------------------------------------------------------------`

ngb.pred_dist(X_test) works properly but the above error is also obtained when using the sklearn function with cross validation, "cross_validate()".

Regards,

implement categorical distribution for multiclass classification

indentation is broken in api.py

I believe the indentation is broken as of the last commit.

>>> from ngboost import NGBRegressor
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/omarwagih/miniconda2/envs/pred_dpsi/lib/python3.8/site-packages/ngboost/__init__.py", line 2, in <module>
    from .api import NGBClassifier, NGBRegressor, NGBSurvival
  File "/Users/omarwagih/miniconda2/envs/pred_dpsi/lib/python3.8/site-packages/ngboost/api.py", line 64
    super().__init__(Dist, Score, Base, natural_gradient, n_estimators, learning_rate,
    ^
IndentationError: expected an indented block

Can you please fix?

minibatch_frac < 1 does not work if X is a dataframe

If X is a pandas dataframe and Y is a series, and minibatch_frac == 1, then ngb.fit(X, Y) works with no issues.

But if minibatch_frac < 1, then the function sample in ngboost.py fails to work on dataframes:

    def sample(self, X, Y, params):
        if self.minibatch_frac == 1.0:
            return np.arange(len(Y)), X, Y, params
        sample_size = int(self.minibatch_frac * len(Y))
        idxs = np_rnd.choice(np.arange(len(Y)), sample_size, replace=False)
        return idxs, X[idxs,:], Y[idxs], params[idxs, :]

Because X[idxs, :] is not valid DataFrame syntax.

A workaround that works on my machine:

    def sample(self, X, Y, params):
        if self.minibatch_frac == 1.0:
            return np.arange(len(Y)), X, Y, params
        sample_size = int(self.minibatch_frac * len(Y))
        idxs = np_rnd.choice(np.arange(len(Y)), sample_size, replace=False)
        try:
              X_batch = X[idxs,:]
        except TypeError:
              X_batch = X.iloc[idxs, :]

        return idxs, X[idxs,:], Y[idxs], params[idxs, :]

I'm running version 0.1.3, installed from github via pip today, on Mac OS 10.14.13, python version 3.7.4

Save and load model

Hi, thank you for this repository. I'm using it in my work. I hope to know how to save and load the trained model. joblib.dump doesn't work.

NGBClassifier use DecisionTreeRegressor by default

default_tree_learner is DecisionTreeRegressor with friedman_mse criterion, which is kinda weird to use for classification.
I may be a bit confused here but is it really fine to use Regressor and not Classifier as a base class? It may be by design, but looks really weird.

implement column sampling

Column sampling by base learner would make NGBoost trivially scalable to high-dimensional datasets, so we should implement it.

Return train and val loss

Thank you for the excellent work with NGBoost, really excited to having been testing it out!

In commit c4b46b9 the fit method was altered to return self instead of the train and val losses. Is there any way to access the losses with the current behavior?

I believe the losses should be accessible, because we may not be interested in doing early stopping but actually training for a longer number of iterations and simply chose the best val loss.

Also, returning the losses is essential to compare different models.

readme: link not working

Hey, great work. FYI, the first link to the ngboost page in your readme is broken. I guess it should be https://stanfordmlgroup.github.io/projects/ngboost/, instead of https://stanfordmlgroup.github.io/project/ngboost/.

Thanks !

why sometimes get "LInAlgError:Singular matrix" error?

model = NGBClassifier(Base=default_tree_learner, Dist=Bernoulli,
Score=MLE, natural_gradient=True, verbose=False,n_estimators = 500)

Here is error

` 226 # fitting
--> 227 model.fit(train_arx,train_ary)
228 if return_proba :
229 predict_value = model.predict_proba(test_arx)

~/anaconda3/lib/python3.6/site-packages/ngboost/ngboost.py in fit(self, X, Y, X_val, Y_val, sample_weight, val_sample_weight, train_loss_monitor, val_loss_monitor, early_stopping_rounds)
119 loss_list += [train_loss_monitor(D, Y_batch)]
120 loss = loss_list[-1]
--> 121 grads = S.grad(D, Y_batch, natural=self.natural_gradient)
122
123 proj_grad = self.fit_base(X_batch, grads, sample_weight)

~/anaconda3/lib/python3.6/site-packages/ngboost/scores.py in grad(forecast, Y, natural)
13 grad = forecast.D_nll(Y)
14 if natural:
---> 15 grad = np.linalg.solve(fisher, grad)
16 return grad
17

<array_function internals> in solve(*args, **kwargs)

~/anaconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py in solve(a, b)
401 signature = 'DD->D' if isComplexType(t) else 'dd->d'
402 extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 403 r = gufunc(a, b, signature=signature, extobj=extobj)
404
405 return wrap(r.astype(result_t, copy=False))

~/anaconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
95
96 def _raise_linalgerror_singular(err, flag):
---> 97 raise LinAlgError("Singular matrix")
98
99 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix`

On the same data, sometimes works fine but sometimes occur errors. What should I do when this error occurs?

ngboost.distns.Normal implementation of nll and D_nll functions

Disclaimer: my background is mainly in (clinical) chemistry and less in maths/statistics (although I try to read and learn about it on a day-to-day basis) so please excuse me in advance if I am missing some obvious points.

I am trying to implement a ngboost.distns.Bernoulli class for binary classification problems using NGBoost algorithms. However, I have some questions around the implementation of the ngboost.distns.Normal, specifically the nll and D_nll functions. The nll (negative log-likelihood) function is written as follows:

def nll(self, Y):
  return -self.dist.logpdf(Y).mean()

My first question would be; is this actually the negative log-likelihood function? As far as my knowledge goes we use the sum of the PDF by implementing logpdf(<data>).sum() to obtain a log-likelihood of a Normal distribution. Is there some specific reason we account for N in this specific nll function? Secondly, the D_nll implementation (derivative of the nll function) is as follows:

def D_nll(self, Y_):
  Y = Y_.squeeze()
  D = np.zeros((self.var.shape[0], 2))
  D[:, 0] = (self.loc - Y) / self.var
  D[:, 1] = 1 - ((self.loc - Y) ** 2) / self.var
  return D

Secondly, why do we use this kind of implementation for the derivative? Wouldn't it be better to go for a more generic way and for instance use scipy.optimize? Has this to do with the fact that we use natural gradients? Third, why do we add 1e-8 to the scale and variance of our Normal distribution when we initialize it?

Thank you very much in advance.

Releases to PyPI and conda-forge

Hello! I am very excited to see this package 🎉. Was wondering what the release plan to PyPI and conda-forge is? If you make a PyPI package, I am happy to help build the conda-forge recipe for you.

Distributions for classification

This algorithm is working really good for regression problems where we can choose the among available distributions. However it is only limited to Bernouilli for classification which outputs a single probability value for a given outcome. Is there anyway we can have a confidence interval estimation for each possible outcome?

Scripts are not working

NGBoost Scripts are not working because renaming and moving files.

This Link is result of executing the script.
Google Colabratory - execute NGBoost Scripts

cannot import Bernoulli from ngboost.distns

Sorry，it couldn't install ngboost under Windows

pip install ngboost
Collecting ngboost
Using cached https://files.pythonhosted.org/packages/58/15/8942e2b8a38f92b92e1dedad882b0746372e4acde236e74b98bfa66d717a/ngboost-0.1.3-py3-none-any.whl
Collecting scikit-learn>=0.21.3
Using cached https://files.pythonhosted.org/packages/76/79/60050330fe57fb59f2c53d0d11673df28c20ea9315da3652477429fc4949/scikit_learn-0.21.3-cp36-cp36m-win_amd64.whl
Collecting numpy>=1.17.2
Using cached https://files.pythonhosted.org/packages/55/7a/f32b39164262765b069b0fe3ec5d4b47580c9c60f7bd3588b58ba8e93a4c/numpy-1.17.3-cp36-cp36m-win_amd64.whl
ERROR: Could not find a version that satisfies the requirement jaxlib>=0.1.29 (from ngboost) (from versions: none)
ERROR: No matching distribution found for jaxlib>=0.1.29 (from ngboost)

return predictions at best iteration

If fit was called with a validation set, subsequent calls to predict should return the predictions from the model at best iteration according to the score on the validation set.

Alternative Implementations Catboost

Dear ngboost dev-team,

there is currently some discussion going on around catboost implementation of probabilistic forecasting:

Implement relevant algorithms from NGBoost
learn prediction intervals (variance, noise)

Since ngboost is mentioned here as well, wanted to let you know.

Feature importance

Hi,
First of all, many thanks for ngboost.

The issue I'm reporting is related to the feature importances property. When I try to get this property, unfortunately the returned value is None. I believe this is due to the following if clause:
if not 'sklearn.tree.tree.DecisionTreeRegressor' in str(type(self.base_models[0][0])): return None
In sklearn version 0.22, str(type(self.base_models[0][0])) returns <class 'sklearn.tree._classes.DecisionTreeRegressor'>

For me, it works if I replace the clause by
if not isinstance(self.base_models[0][0], sklearn.tree.DecisionTreeRegressor):

Can you please check if you also have this problem?

Many thanks,
Carlos

Jaxlib can't be installed under Windows

Collecting jaxlib
Could not find a version that satisfies the requirement jaxlib (from versions: )
No matching distribution found for jaxlib

google/jax#507

the issue raised under jax github page causes installation issues under Windows for this package

Would you consider releasing the PyTorch version too, I could make that work from the backup branch and the results look similar (though I'm not sure that the PyTorch implementation contains all the tricks from the jax version or not)

Any plan for model explanation functionality?

Read the paper and tried this package, had to say it is marvelous! I will not hesitate to use it in my real-life work, but wonder if there is plan to add support for model explanation tool such as feature importance plot, SHAP plot or tree visualizer. It would be crucial to present the model explanation to business stakeholders.
Thanks for this fantastic masterpiece!

minibatch won't work with survival

ngboost.sample() will break if you try to use it with survival data in the expected {Time, Event} format.

Plans for R support?

Are there any plans for implementing this in R?

empirical used retrain not refit as paper

It seesm the empirical results used retrain model (changing tree structures) but not refit (keep pretrained tree structures) in "ngboost/examples/empirical/regression.py". However, the paper said exactly "refit" which is more reasonable

Is it possible to use sample weights, or any plans to add a sample_weight parameter?

This is an amazing project and I have high hopes for using ngboost in my work. I don't currently see any sample_weight functionality. Are there any plans to add this? (I apologize, as I lack the technical expertise to do it myself).

input validation

make it clear that the only acceptable inputs are numeric numpy arrays
should be integers from 0:K in the case of classification

Implementation of unit testing

I recommend you to implement unit testing to develop this package effectively and safely.
I found RuntimeError in NGBRegresor and NGBClassifier noted in this Scikit-Learn source

any example of multiclass classifier?

hello, I've tried NGBClassifier with some configurations, but no luck. thank you.

First try at implementing `ngboost.distn.Bernoulli` class

Disclaimer: forgive me for all my stupid mistakes and/or misinterpretation of several statistical things. My intentions were to provide a complete, working example of the Bernoulli class before uploading it, but because several people told me that they would like to help I decided to put this (preliminary) version already on GitHub. Once it is in a more sophisticated state I guess we can open up a pull request.

So for classification problems we require to have distributions which can match these problems accordingly (e.g. Bernoulli for binary classification). Thus, my aim was to create a Bernoulli class which would make binary classification using NGBoost feasible. The last few days I studied a lot of probability statistics and did my best at reading the NGBoost paper to maximum detail. Please forgive me in advance if I completely misunderstood the whole concept and my implementation might be complete nonsense (if so, please tell me). Underneath I provide a first version of the Bernoulli class, in which I would like to point out several things:

The Bernoulli class was tested on the breast cancer classification dataset (scikit-learn) and it somehow seems to converge. Also the predicted probabilities seem to match with the labelled outcomes of the test dataset. There are some runtime issues, however.
I am not sure how to implement the Bernoulli.fit method, as I would not have any better ideas to just set the initial parameter to the average positive probability in the dataset. Additionally, I am not 100% positive on the nll, D_nll and fisher_info functions.
Running the example gives a lot of RuntimeWarnings exclusively about mathematical operations (e.g. invalid values, divides by zeroes). This is caused due to the NGBoost.line_search function but I yet have to look what exactly is causing this.

Bernoulli class:

class Bernoulli(object):
    """
    Bernoulli class containing the Bernoulli distribution.

    ...

    Attributes
    ----------
    n_params : int
        contains the numeric amount of params in our distribution.

    Methods
    -------
    nll(Y)
        returns the negative log-likelihood dependent on data `Y`.
    D_nll(Y)
        returns the first derivative of the negative log-likelihood dependent on data `Y`.
    fisher_info()
		returns the fisher information 
    """
	
    n_params = 1

    def __init__(self, params):
	# Initialize class 
	# Probablity for succes (only parameter)
        self.p = params[0]
		
	# Initialize the distribution
        self.dist = dist(self.p)

    def __getattr__(self, name):
        if name in dir(self.dist):
            return getattr(self.dist, name)
        return None

    def nll(self, Y):
	# formula: log(p) * X + log(1-p) * 1 - X
        Y = Y.squeeze()
		
        return np.array(-(np.log(self.p) * Y + np.log(1. - self.p) * (1 - Y)))

    def D_nll(self, Y):
	# formula: (X / p) - ((1 - X) / 1-p)
        Y = Y.squeeze()
        D = (Y / self.p) - ((1 - Y) / (1 - self.p))
        return D.reshape(-1, 1)

    def crps(self, Y):
	raise NotImplementedError('crps not implemented yet')

    def crps_metric(self, Y):
        raise NotImplementedError('crps_metric not implemented yet')

    def fisher_info(self):
	# formula: (1 / p(p-1)
        FI = np.ones((self.p.shape[0], 1, 1))
        FI[:, 0, 0] = 1 / (self.p * (self.p-1))
        return FI
		
    def fisher_info_cens(self, Y):
        # not sure, is this a specific function for censored data?
	# those "_cens" functions are not called in the API somewhere, I guess
	# these can be removed and are mainly in other classes cause of deprecated code?
	raise NotImplementedError('fisher_info_cens not implemented yet')
			
    def fit(Y):
	# how to fit to initial generic data?
	# now I set the `p` to the total amount of positive class, not sure if this is correct..
        return np.array([sum(Y.squeeze())/len(Y.squeeze())])

To perform a small test (WARNING: loads of RuntimeWarnings):

from ngboost.ngboost import NGBoost
from ngboost.learners import default_tree_learner
from ngboost.scores import MLE

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, Y = load_breast_cancer(True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ngb = NGBoost(Base=default_tree_learner, Dist=Bernoulli, Score=MLE(), natural_gradient=True,
              verbose=True)
ngb.fit(X_train, Y_train)
Y_dists = ngb.pred_dist(X_test)

test_NLL = Y_dists.nll(Y_test.flatten()).mean()
print('Test NLL', test_NLL)
print(Y_dists.p)
print(Y_test)

Full example also available as a Google Collab at: https://colab.research.google.com/drive/1_O2w1MXjuMKq7bc8Pj4Atv5a_bGKWSp7.

I am open for any suggestions, tips, help, guidance on how to develop this further. And once more, please my apologies in advance if I am completely missing the point somewhere.

Overflow warnings

This package looks so promising!

I am just testing it out on my dataset with dimensions (N, M) = (57795, 144). At first I tried with N=100, N=1000, and N=10_000 and it worked well. Now I am trying to run it on all N=57_795 and I am encountering some overflow errors, see below. Is this something to be worried about?

[iter 0] loss=2.6377 val_loss=0.0000 scale=0.1250 norm=0.3378
~/miniconda3/envs/py37/lib/python3.7/site-packages/ngboost/distns/normal.py:13: RuntimeWarning:

overflow encountered in exp

~/miniconda3/envs/py37/lib/python3.7/site-packages/ngboost/distns/normal.py:14: RuntimeWarning:

overflow encountered in square

Cheers,
Christian

documentation

docstrings
vignettes

default methods for gradient and fisher info

re: #60 (comment)

requires:

autograd
MC estimation of fisher info
implementation of dist.sample() and dist.pdf() methods in terms of parameters in R

Decoupling Aleatoric vs. Epistemic Uncertainty

Thanks for the great work!

Is there any way to estimate aleatoric and epistemic uncertainties separately with this method?

Fixing error

Encountering this error repeatedly - "type object 'LogNormal' has no attribute 'scores' ". How can I fix it?

natural_gradient option in fit doesn't seem to change the result

Hi again,

Setting

NGBoost(Base=default_tree_learner, Dist=Normal, Score=MLE(), natural_gradient=True)

NGBoost(Base=default_tree_learner, Dist=Normal, Score=MLE(), natural_gradient=False)

seems to give the same results. The code below shows at least this is the case for the predictions.

I checked the source code and the line https://github.com/stanfordmlgroup/ngboost/blob/master/ngboost/ngboost.py#L21 sets the attribute natural_gradient but then it doesn't seem to use it anywhere else in that file.

from ngboost.ngboost import NGBoost
from ngboost.learners import default_tree_learner
from ngboost.scores import MLE
from ngboost.distns import Normal

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

np.random.seed(seed=2334)

X, Y = load_boston(True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ngb_natural = NGBoost(Base=default_tree_learner, Dist=Normal, Score=MLE(), natural_gradient=True,
                      verbose=False)

np.random.seed(seed=2334)
ngb_natural.fit(X_train, Y_train)
Y_preds_nat = ngb_natural.predict(X_test)

ngb_artificial = NGBoost(Base=default_tree_learner, Dist=Normal, Score=MLE(), natural_gradient=False,
                      verbose=False)


np.random.seed(seed=2334)
ngb_artificial.fit(X_train, Y_train)
Y_preds_art = ngb_artificial.predict(X_test)

#This one comes out True
assert np.allclose(Y_preds_nat, Y_preds_art)

GPU usage

Hi! I couldn't find anything on the site nor the paper on how to utilize this with GPUs. Is there even any GPU support as of yet?

Do you have any plans to support MAE for Regression problems ?

MAE is popular for regression problems, current model is only using RMSE

early_stopping_rounds

Would it make sense to implement early_stopping_rounds like in LGBM or XGB? If so, I'm happy to contribute to the issue.

Add version attribute to the package

Hi,

I couldn't find any version attribute within the package.
I think it's helpful for bug reports to be able to do

import ngboost
print(ngboost.__version__)

Non reproducibility of results even after np.random.seed is set

Hi,

I expected the training and predictions to be the same after setting np.random.seed
Is there another seed I should set to obtain reproducible results?

Below I have an example you can run. I'm using the current version from Github.

from ngboost.ngboost import NGBoost
from ngboost.learners import default_tree_learner
from ngboost.scores import MLE
from ngboost.distns import Normal

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

np.random.seed(seed=2334)

X, Y = load_boston(True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ngb_natural = NGBoost(Base=default_tree_learner, Dist=Normal, Score=MLE(), natural_gradient=True,
                      verbose=False)

ngb_natural.fit(X_train, Y_train)
Y_preds_nat = ngb_natural.predict(X_test)

ngb_natural2 = NGBoost(Base=default_tree_learner, Dist=Normal, Score=MLE(), natural_gradient=True,
                      verbose=False)

ngb_natural2.fit(X_train, Y_train)
Y_preds_nat2 = ngb_natural2.predict(X_test)

#This one comes out False
assert np.allclose(Y_preds_nat, Y_preds_nat2)

#This one too
assert np.allclose(Y_preds_nat, Y_preds_nat2, rtol=1e-3)

#This one is true
assert np.allclose(Y_preds_nat, Y_preds_nat2, rtol=1e-2)

Add support for GridSearchCV

Thanks for your work sharing. Maybe it is necessary to add support for GridSearchCV of scikit-learn to improve the usability and influence.

How to use the train and val monitors?

Hi, I was wondering which is the purpose of the arguments train_loss_monitor and val_loss_monitor. Sklearn models usually include a monitor argument that allows for early stopping. Is that the idea?

Do you have any plans to use Github Actions?

I want to continue testing (e.g. #35) with Github Actions.

can you provide a classification example?

I would be nice to have an example of classification problem.

gradient of normal distribution negative log-likelihood

I might be wrong but the gradient seems incorrect wrt sigma

I think this should be used, my tests give bad results on boston (nll) with the current implementation, with the following change it's good again

    def D_nll(self, Y_):
        Y = Y_.squeeze()
        D = np.zeros((self.var.shape[0], 2))
        D[:, 0] = (Y - self.loc) / self.var
        D[:, 1] = (Y - self.loc)**2 / self.scale**3 - 1/self.scale
        return -D

classification example - custom scoring metric

In the classification example (https://github.com/stanfordmlgroup/ngboost/blob/master/examples/empirical/clf_sklearn.py) , I tried to provide custom scoring metric:

grid_search = GridSearchCV(ngb, param_grid=param_grid, scoring = 'roc_auc', cv=5)

However, the following error is raised:
'NGBClassifier' object has no attribute 'predict_proba'

I can't install it

Collecting git+https://github.com/stanfordmlgroup/ngboost.git
Cloning https://github.com/stanfordmlgroup/ngboost.git to c:\users\stig.cz\appdata\local\temp\pip-req-build-zjnf7vpn
Running command git clone -q https://github.com/stanfordmlgroup/ngboost.git 'C:\Users\Stig.CZ\AppData\Local\Temp\pip-req-build-zjnf7vpn'
error: RPC failed; curl 18 transfer closed with outstanding read data remaining
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
ERROR: Command errored out with exit status 128: git clone -q https://github.com/stanfordmlgroup/ngboost.git 'C:\Users\Stig.CZ\AppData\Local\Temp\pip-req-build-zjnf7vpn' Check the logs for full command output.

What are the limitations to add a base learner?

I would like to know what are the limitations in adding base learners. I see in ngboost/ngboost/learners.py only two learners implemented, each taken from sklearn. Is it the case that we can add any base learner from sklearn simply by adding to this file some learner and the specifying it at NGBoost instantiation time? If this is not the case what is the limitation? Why aren't more base learners implemented?

classification broken for large n_estimators and small minibatch_frac

X, Y = load_breast_cancer(True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ngb = NGBClassifier(Dist=Bernoulli, verbose=True, minibatch_frac=0.5, n_estimators=50)
ngb.fit(X_train, Y_train)

usually delivers a long error stack terminating in

TypeError: ufunc 'expit' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

but

ngb = NGBClassifier(Dist=Bernoulli, verbose=True, minibatch_frac=1, n_estimators=50)
ngb.fit(X_train, Y_train)

does not, nor does

ngb = NGBClassifier(Dist=Bernoulli, verbose=True, minibatch_frac=0.5, n_estimators=20)
ngb.fit(X_train, Y_train)