statmixedml / xgboostlss Goto Github PK

An extension of XGBoost to probabilistic modelling

Home Page: https://statmixedml.github.io/XGBoostLSS/

License: Apache License 2.0

Python 100.00%

probabilistic-forecasts gamlss xgboost machine-learning prediction-intervals distributional-regression uncertainty-estimation multi-target-regression normalizing-flows mixture-density-model

xgboostlss's People

Contributors

Stargazers

Watchers

xgboostlss's Issues

Could we use the new multi-output `xgboost` ?

It would be nice to have access to the new multi-output feature exposed in xgboost 2.0

https://xgboost.readthedocs.io/en/stable/tutorials/multioutput.html#training-with-vector-leaf

update torch, optuna & scikitlearn install deps

Hi All,

I'm encountering dependency conflicts when trying to install XGBoostLSS using Poetry. The package currently specifies strict maximum versions for several dependencies, including torch, optuna, and scikit-learn. This is causing conflicts with other common packages in my project that require more recent versions of these libraries.

Specifically:

torch: XGBoostLSS requires ~=2.1.2, but the current version is 2.3.0, which is needed by other packages in my project.
optuna: Required as ~=3.5.0, but newer versions might be necessary for compatibility with other libraries.
scikit-learn: Specified as ~=1.4.0, which may conflict with other packages requiring newer versions.

Would it be possible to update or relax these version requirements? This could involve:

Updating the dependencies to the latest stable versions that are compatible with XGBoostLSS.
Relaxing the version constraints to allow for a wider range of compatible versions (e.g., using ">=" instead of "~=" and specifying a higher maximum version).
If strict version requirements are necessary for certain dependencies, clearly documenting the reasons in the README or requirements file.

Here are some suggested edits for the setup.py file:

install_requires=[
    "xgboost>=2.0.0,<3.0.0",
    "torch>=2.1.0,<3.0.0",
    "pyro-ppl>=1.8.0,<2.0.0",
    "optuna>=3.0.0,<4.0.0",
    "scikit-learn>=1.0.0,<2.0.0",
    "numpy>=1.20.0,<2.0.0",
    "pandas>=1.0.0,<3.0.0",
    # ... (similar changes for other dependencies)
]

Support for XGBoost >= 1.6

Would it be possible to support XGBoost 1.6 (or later)? If so, what would be the process of getting that out (can possibly take a look at upgrading it/raising a PR)?

Context is that we have some other dependencies that require XGBoost 1.6 or greater which clashes with the setup.py here.

Zero (and one?) adjusted Dirichlet?

Would it be possible to implement a 0 (and maybe even 0 and 1) adjusted dirichlet distribution, similar to:

Tsagris, M., & Stewart, C. (2018). A Dirichlet Regression Model for Compositional Data with Zeros. Lobachevskii Journal of Mathematics, 39(3), 398–412. doi:10.1134/s1995080218030198

Model Tuning: Validation Metric

I'm looking into the code for the xgboostlss class and it seems like the validation metric is hardcoded to use negative log-likelihood. Is there going to be flexibility to define the validation metric chosen? (i.e. MAE, etc.)

As of now my tuning process is returning inf for each trial.

Simulated Data Image

package install

I am asking the procedure to install the package?, used install_github but did not work

code upload?

Hello, I have read the paper and I have to say that it is some really great work. Is it possible to upload the code as other have asked for it? You have 87 stars and 14 forks without uploading the code. You could help a lot of people..

Suggestion: spin-off the `distributions` module into a shared common dependency among all *lss modules

First of all thank you for working on several boosting LSS versions (xgboost, catboost, lightgbm)!

I did notice that both xgb and lightgbm have the (at first sight) exact same distributions.py submodule that is the torch core of for any distribution supported as a boosting prediction:

Is there a specific reason to not share the distribution modules (and any other shared functionality, like plotting of distribution results) in a common Python module that then gets imported/shared as a common dependency here to avoid duplication; and separate development from adding distributions vs adding functionality to the boosting part of the modules.

Support categorical features?

Does XGBoostLLS support categorical features and associated hyperparameters ?

SHAP for categorical features

I'm getting an error when trying to conduct SHAP interpretations for a model containing categorical features:

[18:06:07] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
<IPython.core.display.HTML object>
Error in py_call_impl(callable, call_args$unnamed, call_args$named) : 
  xgboost.core.XGBoostError: [18:06:07] /workspace/src/tree/tree_model.cc:899: Check failed: !HasCategoricalSplit(): Please use JSON/UBJSON for saving models with categorical splits.

Multi-task learning and ONNX support

Thanks for the great work!

I have two questions:

Can we perform multi-task learning in a single training where one task is classification and the target variables are categorical (classes) and the other task is regression where the target values are continuous?
Does XGBoostLSS models support ONNX conversions?

Support for censored probabilistic regression (survival analysis; AFT models)

Thanks for this great library!

I was wondering if you plan to include censored loss functions as well to allow for censored probabilistic regression. Similar to what XGB AFT models do, but with a non-constant scale parameters in the distribution models (https://xgboost.readthedocs.io/en/stable/tutorials/aft_survival_analysis.html).

XGBoostLSS Readme Review

@uday1889: May I ask you to check out the readme and to provide a feedback.

EvoTreesLSS?

Hi! First, I just wanted to say thank you so much for XGBoostLSS and LightGBMLSS, they're amazing packages and super useful :)

I wanted to ask, have you considered adding support for an EvoTrees backend? I think this would be incredibly helpful; EvoTrees.jl is the main package I use for regression trees. Thank you!

Beta distribution - 0 log-likelihood, 'mean of empty string' message

Hello! Been exploring your package for a predictive modelling project, and think I've found an issue? Either that or I've missed something important I need to do, but either way it's not working. Basically, whenever I try to train a model using a beta distribution as the output, I get log-likelihoods of 0 every time. I've put reproducible code below, using the sklearn diabetes regression toy dataset as an example (alright, it's not a beta distribution, but it should still return something non-zero...). The problem doesn't occur with Gaussian distribution, and it occurs regardless of stabilisation method, and whether I'm using train, cv or hyper_opt, so it's not just a hyperparameter choice. During the training at some point you also get either a message saying Mean of empty slice or All-NaN slice encountered, so that probably points to the cause. I'm not entirely sure what's causing this, but it's probably an issue with the custom objective function? Could be the custom metrics, I suppose, but I think the objective function is more likely.

Anyway, if you could take a look at it and let me know if I'm doing something wrong, or if there is something weird going on under the hood, that would be great! Thanks very much.

from sklearn.datasets import load_diabetes
from xgboostlss.model import xgboostlss
from xgboostlss.distributions import Beta, Gaussian
from xgboost import DMatrix

X, y = load_diabetes(return_X_y=True, as_frame=True)
train_x = X[:-50]
train_y = y[:-50]
test_x = X[-50:]
test_y = y[-50:]
dtrain = DMatrix(train_x, train_y)
dtest = DMatrix(test_x, test_y)

beta = Beta
beta.stabilize = "L2"

single_trial_params = {
    "eta": 0.01,                   
    "max_depth": 4,
    "gamma": 1,
    "subsample": 0.6,
    "colsample_bytree": 0.7,
    "min_child_weight": 1,
}

eval_results = {}
model = xgboostlss.train(
    params=single_trial_params,
    dtrain=dtrain,
    dist=beta,
    evals=[(dtrain, 'train'), (dtest, 'test')],
    evals_result=eval_results
)

Multi-parameter optimization with custom loss function for probabilistic forecasting

Description

Dear community,

I am currently working in a probabilistic extension of XGBoost that models all parameters of a distribution. This allows to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived.

The problem is that XGBoost doesn't permit do optimize over several parameters. Assume we have a Normal distribution y ~ N(µ, sigma). So far, my approach is a two-step procedure, where I first optimize µ with sigma fixed, and then optimize sigma with µ fixed and then iterate between these two.

Since this is inefficient, are there any ways of simultaneously optimize both µ and sigma using a custom loss function?

implement python version

Statistical test for distribution fitting

Thanks for the great package!
Currently, when fitting data to a distribution only the NLLH loss is returned.
I would like to have some statistical test result as well, like the Kolmogorov–Smirnov test.
Any hint how I can get that?

Thanks!

Feature Request: Support Lambert W x F distributions

It would be great if XGBoostLSS can support Lambert W x F distributions; particularly useful are Lambert W x Gaussian distributions (Tukey's h is a special case of this for $\alpha=1$ and $h = \delta$) as they can be used to transform data to normally distributed data, even if original data is (very) heavy tailed.

In XGBoostLSS context I can see this being useful any time where normal regression might be too restrictive to give correct tail probability estimates (e.g., low sample size; financial data) and one can inspect $\delta$ predictions from XGBoostLSS for which parts of space have long/heavy tail (more uncertainty) than others. Secondly, skewed/heavy-tail Lambert W x gamma distributions are useful to impose heavier right tail for survival like problems.

I'm not aware of a pytorch implementation of Lambert W function, let alone Lambert W x F distributions. TensorFlow has both implemented; scipy.special.lambertw implements the Lambert W function.

If a pytorch implementation of the distribution is required to make this work in XGBoostLSS, then as an alternative AFAICT this should be possible to accomplish using normalizing flows, with the heavy-tail Lambert W transformation as a specific normalizing flow function.

References

heavy-tail Lambert W x F distributions (Goerg, 2015)
LambertW R package: https://github.com/gmgeorg/LambertW
TensorFlow probability implementation of Lambert W bijectors and Lambert W x Gaussian Distribution
gaussianization layers based on Lambert W x F transformations/distributions: https://openreview.net/forum?id=OXP9Ns0gnIq

Cannot Find the Source Code

Hi, this is a great library and I would like to tweak the source code to do my stuff, are you plan to share your source code?

`skpro` integration

It would be great to integrate this package - and adjacent ones like LightGBMLSS - with skpro, which in turn directly integrates with sktime for time series forecasting.
(both of course integate seamlessly with sklearn)

Issue opened here: sktime/skpro#184

This is very similar to the suggestion of @joshdunnlime for sklearn interface, skpro provides interface specifications and stringent tests (no need to write new ones!) for probabilistic tabular regressors already.

What would be needed is, as far as I see it:

predict_proba interface
distributions from XGBoostLSS implemented as skpro tabular distributions

Architecturally, there are two options:

small changes in XGBoostLSS, and work done in skpro in interfacing
or, import check_estimator from skpro (works on distribution objects as well as on estimators), and use that to create fully skpro conformant interfaces within XGBoostLSS. Then have a light import wrapper in skpro.
- it is perhaps worthy of note that skpro already has an adapter to tensorflow for distributions.

Personally, I would think option 1 is preferable at least for the distributions, since the different distribution types are of general use, including for statmixedML's other packages, so it would avoid duplication of distribution objects or interfaces.

Reducing `install_requires` to minimum (& expand `extras_require`) and looser version ranges

At quick glance it seems that the current setup.py file is fully exhaustive on all dependencies as an absolute requirement including very specific ranges for versions. If at all possible, it would greatly improve compatibility w/ existing Python repos (and more people being able to use it w/o having to resolve conflicts) if the install_requires was only specifiying absolutely required modules (e.g., plotting or optuna is not really required to use this great package) and the minimum version date (>=) needed, instead of the (approximate) ~= range.

Curious to learn more about whether this package has to be so specific/restrictive on the dependencies (e.g., suggest to use
https://stackoverflow.com/questions/10572603/specifying-optional-dependencies-in-pypi-python-setup-py

Expectile Crossing and Predicted Distribution Recentering at Zero

I've opened this issue to pick up our earlier conversation on expectile crossing after the 0.2.0 update.

I reinstalled XGBoostLSS and re-ran the Expectile Modelling ipynb that I shared earlier at this link : https://github.com/maxfield-green/XGBoostLSS/blob/master/examples/simulation_example_Expectile_v2.ipynb

I'm still observing the recentering and expectile crossing behavior that I initially sited.

Regarding XGBoostLSS

Is it possible to do point prediction using XGBoostLSS? Because, if it can only predict a prediction interval, then after uncertainty quantification, how could we know whether the predictions are improving and most importantly along with uncertainty quantification we need the point prediction of the parameters in our research. So, any help regarding this problem will be very helpful.

Publish to PyPi

For distribution, it would be very helpful to publish to PyPi. A likely pre-requisite is tagging versions (see #18).

Let me know if you would like any assistance with this.

Distribution loss_fn

Hi there,
with the recent update on the model.py I start to get an error:
'DistributionClass' object has no attribute 'loss_fn'

--> 372 pruning_callback = optuna.integration.XGBoostPruningCallback(trial, f"test-{self.dist.loss_fn}")
374 xgblss_param_tuning = self.cv(params=hyper_params,
375 dtrain=dtrain,
376 num_boost_round=num_boost_round,
(...)
381 verbose_eval=False
382 )
384 opt_rounds = xgblss_param_tuning[f"test-{self.dist.loss_fn}-mean"].idxmin() + 1

Any idea how to fix it?

XGBoostLSS for Julia

Hi,
your package & paper look really promising. I can't wait to test drive it.
The readme mentions a Julia implementation is planned, that would be amazing.

May I suggest, consider wrapping it in the interface MLJ.jl which already has several boosting & probabilistic forecasting options with more on the way...

Cannot install package

Description

The original shap package is currently not maintained. Hence, in its current implementation, shap is not compatible with numpy>=1.24.0,. For details see the following issue shap/shap#2911.

Workaround solution

Hence, XGBoostLSS currently relies on https://github.com/dsgibbons/shap.git. For this package to be properly installed, please avoid installing xgboostlss in a directory/path or conda/venv environment that contains "xgboost/xgboostlss" or any other xgboost related name. Otherwise, the dsgibbons/shap won't turn off cuda building in dsgibbons/shap setup() call and xgboostlss will likely not install properly. See the following issue dsgibbons/shap#50.

Cannot install package

Description

The original shap package is currently not maintained. Hence, in its current implementation, shap is not compatible with numpy>=1.24.0,. For details see the following issue shap/shap#2911.

Workaround solution

Hence, XGBoostLSS currently relies on https://github.com/dsgibbons/shap.git. For this package to be properly installed, please avoid installing xgboostlss in a directory/path or conda environment that contains "xgboost/xgboostlss" or any other xgboost related name. Otherwise, the dsgibbons/shap won't turn off cuda building in dsgibbons/shap setup() call and xgboostlss won't install properly. See the following issue dsgibbons/shap#50.

Feature request: Support Tweedie Distribution

Dear colleagues,

According to the docs:

Gamma distribution is supported but at the end gamma part of the tweedie family.

Alternative packages already implement this as for instance:

https://github.com/CDonnerer/xgboost-distribution/

BR
E

Choose values of loc, scale, etc. for norm.ppf

Helllo, may I ask a question regarding Plot of Actual vs. Predicted Quantiles part in this link: https://github.com/StatMixedML/XGBoostLSS/blob/master/examples/simulation_example_Gaussian.ipynb,
How could we choose values for loc, scale, etc.? Thanks.
norm.ppf(quant_sel[0], loc = 10, scale = 1 + 4*((0.3 < test["x"].values) & (test["x"].values < 0.5)) + 2*(test["x"].values > 0.7))

large scale values for NBI distribution

Hi,

Thank you for creating XGBoostLSS.
I have a question, When I set the distribution to "NBI" (Negative Binomial) and make predictions for some samples, I see that the scale parameter is always the same for all samples and it is greater than zero (for example, 88). I suppose that for negative binomial, scale is equal to 1/n and because n>0 it should be less than 1 for most of the time.

Do I have any misunderstanding on this?

xgboostlss.predict() only gives results for pred_type="parameters"

I'm following the examples. Everything else works correctly but for some strange reason, when I get to the cell where it makes predictions, I always get the output for pred_type="parameters". I would greatly appreciate any help on this matter.

Thank you!

Constant prediction parameters

When I try to use BCT distribution for a dataset, the resulting prediction parameters (location, nu, tau) are all constant. I believe nu, tar are their initial values (0.5,10). Only scale changes like the example with Gaussian distribution. I am wondering what does this mean?

The parameters I used for training are:
Best trial: Value: 125.35836839999999 Params: eta: 0.040946512023655214 max_depth: 4 gamma: 2.9779429635527073e-05 subsample: 0.2887880027234064 colsample_bytree: 0.3208686624698766 min_child_weight: 316 opt_rounds: 500

Out of MemoryError

Hi, I'm trying to train the XGBLSS on the m5 dataset but I keep on getting out of MemoryError with the hyper_opt method. Is there a parameter to reduce the amount of memory used for this method?

Bump versions and tag them

With active development, it may be helpful to bump the version after significant changes, and tag them. This will also help in publishing to pypi.

Let me know if you would like any assistance with this.

Serialization issue

Dear Alexander,

We are now comparing MAPIE confidence intervals to a custom methodology we have created based on the XGBoostLSS.

Basically we are using the expectiles that XGBoostLSS provides us to fit a CDF that is an approximation to a tweedie CDF. (This is a bit hacky but we are getting theoretical coverage).

The problem is that we are splitting this procedure in 2 parts and we are using the joblib of XGBoostLSS to the input of the CDF approximation process.

I have not seen any example of how to serialize XGBoostLSS objects but when trying with joblib:

        X_train_ci = processor_reg.transform(regression_df.drop(columns=[arguments.clf_targets] +
                                                                                [arguments.reg_targets])) 
        y_train_ci = regression_df[arguments.reg_targets]
        n_cpu = multiprocessing.cpu_count()
        dtrain = xgb.DMatrix(X_train_ci, label=y_train_ci, nthread=n_cpu)

        logging.info("Training XGBoost")
        xgboostlss_model_expectile = xgboostlss.train(hyperparameters,
                                                      dtrain,
                                                      dist=distribution_expectile,
                                                      num_boost_round=hyperparameters["opt_rounds"],
                                                      verbose_eval=True)
        
        if partial:
            logging.info("Serializing partial fit")
            joblib.dump(xgboostlss_model_expectile, 'outputs/models/zip_conf_partial_fit.joblib')
        else:
            logging.info("Serializing full fit")
            joblib.dump(xgboostlss_model_expectile, 'outputs/models/zip_conf.joblib')

We are getting the following error:

Being distribution expectile defined as:

        distribution_expectile = Expectile                                   
        distribution_expectile.expectiles =  arguments.exp_list
        distribution_expectile.stabilize = "MAD"

Are we choosing the wrong serialization method? Do we need to save XGBoostLSS in a different way?

BR
Edgar

Loading Models - Losing Information

When I use joblib to load a model that build by the xgblss, I found that the reloaded model lose some information. The reloaded model will output the different values from the original model even though they have the same input. The difference of the outputs is a constant for all inputs, can you fix this? I think this is because model's attributes get lost after reloading the models.

By the way, is it possible to add quantile regression to the distribution and multitasking learning? I think they are quite useful.

XGBoostLSSRegressor - Scikit Learn API

Is there an appetite to add a scikit learn API for this? If so, very happy to help contrib. Also for LightGBMLSS.

XGBoostLSS for uncertainty estimation in binary classification

Hi,

Thanks for creating the XGBoostLSS (and LightGBMLSS) package, this an important extension to ML algorithms that only provide point estimates for the mean.

I have a question, or thought, about capturing uncertainty for binary classification. I would be interested in estimating the uncertainty in the model score in a binary classification, as a consequence of the amount of data that the algorithm has seen during training. If in a certain part of the feature space there was one positive and one negative training instance, the point estimate will be p=0.5, but with a lot less certainty than if there would have been a hundred positive and a hundred negative instances. One way to estimate the uncertainty in the binomial proportion p for a given number of negative (a) and positive (b) observations is to use a Beta distribution with parameters α = a + 1/2 and β = b + 1/2. This is referred to as Jeffreys interval in this paper, for example:
https://projecteuclid.org/journals/statistical-science/volume-16/issue-2/Interval-Estimation-for-a-Binomial-Proportion/10.1214/ss/1009213286.full

I wonder if the methods of XGBoostLSS can be used, or extended, to return the α, β parameters of a Beta distribution that fits the binary data observed during training. This would be a very valuable estimate of uncertainty. I see that the Beta distribution is implemented in XGBoostLSS, but if I understand correctly, this is to fit a regression on target variables that themselves follow the Beta distribution (not for classification of binomial target variables). And I'm not sure how to obtain the NLL for the Beta distribution for a binary target variable: the Beta PDF is 0 or ∞ at x=0 and x=1.

I have to admit that I am a bit out of my depth on the mathematics of this question, so if my reasoning is entirely in the wrong direction or incompatible with the idea behind XGBoostLSS, I am also happy to hear!

Question on scaling to large dataset

Hi Alexander, thanks for this incredible work. I really like this package; it's really helpful in situations where providing point forecast is not enough.

However, I'm curious about scaling to large dataset. Please pardon my ignorance. I've seen the examples in the notebook section but the datasets used isn't much.

So, my question is, does the current implementation works in similar fashion as the xgboost package? That is, one can leverage spark, dask, etc out of the box with xgboostlss?

I'll appreciate it if you can shed some light on this.

Thanks.

Distribution error: Tensor of shape..

First of all thanks for the new pytorch version.
I've been using the previous versions and today saw a new version and wanted to give it a try, with the data and code that worked fine in previous version.

After I edited the old code with the differences of the new version, following the examples, i've noticed some problems with distributions.
So when I give my label data to the optimization or training I get something on the lines of:

Expected value argument (Tensor of shape (8700,)) to be within the support (GreaterThan(lower_bound=0.0)) of the distribution LogNormal(), but found invalid values:
tensor([0.2782, 0.3064, 0.3202,  ..., 0.3338, 0.3202, 0.3202])

or

Expected parameter scale (Tensor of shape ()) of distribution Normal(loc: -974.14453125, scale: -1210.6866455078125) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
-1210.6866455078125

This happens with every distribution that I've tried.
My dataset is between 0.09 and 0.9 in value, I've tried with similar datasets and got similar results. With one dataset I've managed to run the model by multiplying the values by 10, but for other datasets does not work.

Reminding that all these datasets worked fine with previous version. Do you know what might be the reason?

statmixedml / xgboostlss Goto Github PK

xgboostlss's People

Contributors

Stargazers

Watchers

Forkers

xgboostlss's Issues

Description

Description

Workaround solution

Description

Workaround solution

Recommend Projects

Recommend Topics

Recommend Org