statmixedml / xgboostlss Goto Github PK
View Code? Open in Web Editor NEWAn extension of XGBoost to probabilistic modelling
Home Page: https://statmixedml.github.io/XGBoostLSS/
License: Apache License 2.0
An extension of XGBoost to probabilistic modelling
Home Page: https://statmixedml.github.io/XGBoostLSS/
License: Apache License 2.0
It would be nice to have access to the new multi-output feature exposed in xgboost 2.0
https://xgboost.readthedocs.io/en/stable/tutorials/multioutput.html#training-with-vector-leaf
Hi All,
I'm encountering dependency conflicts when trying to install XGBoostLSS
using Poetry. The package currently specifies strict maximum versions for several dependencies, including torch, optuna, and scikit-learn. This is causing conflicts with other common packages in my project that require more recent versions of these libraries.
Specifically:
Would it be possible to update or relax these version requirements? This could involve:
Here are some suggested edits for the setup.py file:
install_requires=[
"xgboost>=2.0.0,<3.0.0",
"torch>=2.1.0,<3.0.0",
"pyro-ppl>=1.8.0,<2.0.0",
"optuna>=3.0.0,<4.0.0",
"scikit-learn>=1.0.0,<2.0.0",
"numpy>=1.20.0,<2.0.0",
"pandas>=1.0.0,<3.0.0",
# ... (similar changes for other dependencies)
]
Would it be possible to support XGBoost 1.6 (or later)? If so, what would be the process of getting that out (can possibly take a look at upgrading it/raising a PR)?
Context is that we have some other dependencies that require XGBoost 1.6 or greater which clashes with the setup.py here.
Would it be possible to implement a 0 (and maybe even 0 and 1) adjusted dirichlet distribution, similar to:
Tsagris, M., & Stewart, C. (2018). A Dirichlet Regression Model for Compositional Data with Zeros. Lobachevskii Journal of Mathematics, 39(3), 398–412. doi:10.1134/s1995080218030198
I'm looking into the code for the xgboostlss class and it seems like the validation metric is hardcoded to use negative log-likelihood. Is there going to be flexibility to define the validation metric chosen? (i.e. MAE, etc.)
As of now my tuning process is returning inf for each trial.
I am asking the procedure to install the package?, used install_github but did not work
Hello, I have read the paper and I have to say that it is some really great work. Is it possible to upload the code as other have asked for it? You have 87 stars and 14 forks without uploading the code. You could help a lot of people..
First of all thank you for working on several boosting LSS versions (xgboost, catboost, lightgbm)!
I did notice that both xgb and lightgbm have the (at first sight) exact same distributions.py
submodule that is the torch
core of for any distribution supported as a boosting prediction:
Is there a specific reason to not share the distribution modules (and any other shared functionality, like plotting of distribution results) in a common Python module that then gets imported/shared as a common dependency here to avoid duplication; and separate development from adding distributions vs adding functionality to the boosting part of the modules.
Does XGBoostLLS support categorical features and associated hyperparameters ?
I'm getting an error when trying to conduct SHAP interpretations for a model containing categorical features:
[18:06:07] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
<IPython.core.display.HTML object>
Error in py_call_impl(callable, call_args$unnamed, call_args$named) :
xgboost.core.XGBoostError: [18:06:07] /workspace/src/tree/tree_model.cc:899: Check failed: !HasCategoricalSplit(): Please use JSON/UBJSON for saving models with categorical splits.
Thanks for the great work!
I have two questions:
Can we perform multi-task learning in a single training where one task is classification and the target variables are categorical (classes) and the other task is regression where the target values are continuous?
Does XGBoostLSS models support ONNX conversions?
Thanks for this great library!
I was wondering if you plan to include censored loss functions as well to allow for censored probabilistic regression. Similar to what XGB AFT models do, but with a non-constant scale parameters in the distribution models (https://xgboost.readthedocs.io/en/stable/tutorials/aft_survival_analysis.html).
@uday1889: May I ask you to check out the readme and to provide a feedback.
Hi! First, I just wanted to say thank you so much for XGBoostLSS and LightGBMLSS, they're amazing packages and super useful :)
I wanted to ask, have you considered adding support for an EvoTrees backend? I think this would be incredibly helpful; EvoTrees.jl is the main package I use for regression trees. Thank you!
Hello! Been exploring your package for a predictive modelling project, and think I've found an issue? Either that or I've missed something important I need to do, but either way it's not working. Basically, whenever I try to train a model using a beta distribution as the output, I get log-likelihoods of 0 every time. I've put reproducible code below, using the sklearn diabetes regression toy dataset as an example (alright, it's not a beta distribution, but it should still return something non-zero...). The problem doesn't occur with Gaussian distribution, and it occurs regardless of stabilisation method, and whether I'm using train
, cv
or hyper_opt
, so it's not just a hyperparameter choice. During the training at some point you also get either a message saying Mean of empty slice
or All-NaN slice encountered
, so that probably points to the cause. I'm not entirely sure what's causing this, but it's probably an issue with the custom objective function? Could be the custom metrics, I suppose, but I think the objective function is more likely.
Anyway, if you could take a look at it and let me know if I'm doing something wrong, or if there is something weird going on under the hood, that would be great! Thanks very much.
from sklearn.datasets import load_diabetes
from xgboostlss.model import xgboostlss
from xgboostlss.distributions import Beta, Gaussian
from xgboost import DMatrix
X, y = load_diabetes(return_X_y=True, as_frame=True)
train_x = X[:-50]
train_y = y[:-50]
test_x = X[-50:]
test_y = y[-50:]
dtrain = DMatrix(train_x, train_y)
dtest = DMatrix(test_x, test_y)
beta = Beta
beta.stabilize = "L2"
single_trial_params = {
"eta": 0.01,
"max_depth": 4,
"gamma": 1,
"subsample": 0.6,
"colsample_bytree": 0.7,
"min_child_weight": 1,
}
eval_results = {}
model = xgboostlss.train(
params=single_trial_params,
dtrain=dtrain,
dist=beta,
evals=[(dtrain, 'train'), (dtest, 'test')],
evals_result=eval_results
)
Dear community,
I am currently working in a probabilistic extension of XGBoost that models all parameters of a distribution. This allows to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived.
The problem is that XGBoost doesn't permit do optimize over several parameters. Assume we have a Normal distribution y ~ N(µ, sigma). So far, my approach is a two-step procedure, where I first optimize µ with sigma fixed, and then optimize sigma with µ fixed and then iterate between these two.
Since this is inefficient, are there any ways of simultaneously optimize both µ and sigma using a custom loss function?
Thanks for the great package!
Currently, when fitting data to a distribution only the NLLH loss is returned.
I would like to have some statistical test result as well, like the Kolmogorov–Smirnov test.
Any hint how I can get that?
Thanks!
It would be great if XGBoostLSS can support Lambert W x F distributions; particularly useful are Lambert W x Gaussian distributions (Tukey's h is a special case of this for
In XGBoostLSS context I can see this being useful any time where normal regression might be too restrictive to give correct tail probability estimates (e.g., low sample size; financial data) and one can inspect
I'm not aware of a pytorch
implementation of Lambert W function, let alone Lambert W x F distributions. TensorFlow has both implemented; scipy.special.lambertw
implements the Lambert W function.
If a pytorch
implementation of the distribution
is required to make this work in XGBoostLSS, then as an alternative AFAICT this should be possible to accomplish using normalizing flows, with the heavy-tail Lambert W transformation as a specific normalizing flow function.
References
heavy-tail Lambert W x F distributions (Goerg, 2015)
LambertW R package: https://github.com/gmgeorg/LambertW
TensorFlow probability implementation of Lambert W bijectors and Lambert W x Gaussian Distribution
gaussianization layers based on Lambert W x F transformations/distributions: https://openreview.net/forum?id=OXP9Ns0gnIq
Hi, this is a great library and I would like to tweak the source code to do my stuff, are you plan to share your source code?
It would be great to integrate this package - and adjacent ones like LightGBMLSS
- with skpro
, which in turn directly integrates with sktime
for time series forecasting.
(both of course integate seamlessly with sklearn
)
Issue opened here: sktime/skpro#184
This is very similar to the suggestion of @joshdunnlime for sklearn
interface, skpro
provides interface specifications and stringent tests (no need to write new ones!) for probabilistic tabular regressors already.
What would be needed is, as far as I see it:
predict_proba
interfaceXGBoostLSS
implemented as skpro
tabular distributionsArchitecturally, there are two options:
XGBoostLSS
, and work done in skpro
in interfacingcheck_estimator
from skpro
(works on distribution objects as well as on estimators), and use that to create fully skpro
conformant interfaces within XGBoostLSS
. Then have a light import wrapper in skpro
.
skpro
already has an adapter to tensorflow
for distributions.Personally, I would think option 1 is preferable at least for the distributions, since the different distribution types are of general use, including for statmixedML
's other packages, so it would avoid duplication of distribution objects or interfaces.
At quick glance it seems that the current setup.py
file is fully exhaustive on all dependencies as an absolute requirement including very specific ranges for versions. If at all possible, it would greatly improve compatibility w/ existing Python repos (and more people being able to use it w/o having to resolve conflicts) if the install_requires
was only specifiying absolutely required modules (e.g., plotting or optuna
is not really required to use this great package) and the minimum version date (>=
) needed, instead of the (approximate) ~=
range.
See also first accepted answer here:
https://stackoverflow.com/questions/6947988/when-to-use-pip-requirements-file-versus-install-requires-in-setup-py
Curious to learn more about whether this package has to be so specific/restrictive on the dependencies (e.g., suggest to use
https://stackoverflow.com/questions/10572603/specifying-optional-dependencies-in-pypi-python-setup-py
I've opened this issue to pick up our earlier conversation on expectile crossing after the 0.2.0 update.
I reinstalled XGBoostLSS and re-ran the Expectile Modelling ipynb that I shared earlier at this link : https://github.com/maxfield-green/XGBoostLSS/blob/master/examples/simulation_example_Expectile_v2.ipynb
I'm still observing the recentering and expectile crossing behavior that I initially sited.
Is it possible to do point prediction using XGBoostLSS? Because, if it can only predict a prediction interval, then after uncertainty quantification, how could we know whether the predictions are improving and most importantly along with uncertainty quantification we need the point prediction of the parameters in our research. So, any help regarding this problem will be very helpful.
For distribution, it would be very helpful to publish to PyPi. A likely pre-requisite is tagging versions (see #18).
Let me know if you would like any assistance with this.
Hi there,
with the recent update on the model.py I start to get an error:
'DistributionClass' object has no attribute 'loss_fn'
--> 372 pruning_callback = optuna.integration.XGBoostPruningCallback(trial, f"test-{self.dist.loss_fn}")
374 xgblss_param_tuning = self.cv(params=hyper_params,
375 dtrain=dtrain,
376 num_boost_round=num_boost_round,
(...)
381 verbose_eval=False
382 )
384 opt_rounds = xgblss_param_tuning[f"test-{self.dist.loss_fn}-mean"].idxmin() + 1
Any idea how to fix it?
Hi,
your package & paper look really promising. I can't wait to test drive it.
The readme mentions a Julia implementation is planned, that would be amazing.
May I suggest, consider wrapping it in the interface MLJ.jl which already has several boosting & probabilistic forecasting options with more on the way...
The original shap package is currently not maintained. Hence, in its current implementation, shap is not compatible with numpy>=1.24.0,. For details see the following issue shap/shap#2911.
Hence, XGBoostLSS currently relies on https://github.com/dsgibbons/shap.git
. For this package to be properly installed, please avoid installing xgboostlss in a directory/path or conda/venv environment that contains "xgboost/xgboostlss" or any other xgboost related name. Otherwise, the dsgibbons/shap
won't turn off cuda building in dsgibbons/shap setup()
call and xgboostlss will likely not install properly. See the following issue dsgibbons/shap#50.
The original shap package is currently not maintained. Hence, in its current implementation, shap is not compatible with numpy>=1.24.0,. For details see the following issue shap/shap#2911.
Hence, XGBoostLSS currently relies on https://github.com/dsgibbons/shap.git
. For this package to be properly installed, please avoid installing xgboostlss in a directory/path or conda environment that contains "xgboost/xgboostlss" or any other xgboost related name. Otherwise, the dsgibbons/shap
won't turn off cuda building in dsgibbons/shap setup()
call and xgboostlss won't install properly. See the following issue dsgibbons/shap#50.
Dear colleagues,
According to the docs:
Gamma distribution is supported but at the end gamma part of the tweedie family.
Alternative packages already implement this as for instance:
https://github.com/CDonnerer/xgboost-distribution/
BR
E
Helllo, may I ask a question regarding Plot of Actual vs. Predicted Quantiles part in this link: https://github.com/StatMixedML/XGBoostLSS/blob/master/examples/simulation_example_Gaussian.ipynb,
How could we choose values for loc, scale, etc.? Thanks.
norm.ppf(quant_sel[0], loc = 10, scale = 1 + 4*((0.3 < test["x"].values) & (test["x"].values < 0.5)) + 2*(test["x"].values > 0.7))
Hi,
Thank you for creating XGBoostLSS.
I have a question, When I set the distribution to "NBI" (Negative Binomial) and make predictions for some samples, I see that the scale parameter is always the same for all samples and it is greater than zero (for example, 88). I suppose that for negative binomial, scale is equal to 1/n and because n>0 it should be less than 1 for most of the time.
Do I have any misunderstanding on this?
I'm following the examples. Everything else works correctly but for some strange reason, when I get to the cell where it makes predictions, I always get the output for pred_type="parameters". I would greatly appreciate any help on this matter.
Thank you!
When I try to use BCT distribution for a dataset, the resulting prediction parameters (location, nu, tau) are all constant. I believe nu, tar are their initial values (0.5,10). Only scale changes like the example with Gaussian distribution. I am wondering what does this mean?
The parameters I used for training are:
Best trial: Value: 125.35836839999999 Params: eta: 0.040946512023655214 max_depth: 4 gamma: 2.9779429635527073e-05 subsample: 0.2887880027234064 colsample_bytree: 0.3208686624698766 min_child_weight: 316 opt_rounds: 500
Hi, I'm trying to train the XGBLSS on the m5 dataset but I keep on getting out of MemoryError with the hyper_opt method. Is there a parameter to reduce the amount of memory used for this method?
With active development, it may be helpful to bump the version after significant changes, and tag them. This will also help in publishing to pypi.
Let me know if you would like any assistance with this.
Dear Alexander,
We are now comparing MAPIE confidence intervals to a custom methodology we have created based on the XGBoostLSS.
Basically we are using the expectiles that XGBoostLSS provides us to fit a CDF that is an approximation to a tweedie CDF. (This is a bit hacky but we are getting theoretical coverage).
The problem is that we are splitting this procedure in 2 parts and we are using the joblib of XGBoostLSS to the input of the CDF approximation process.
I have not seen any example of how to serialize XGBoostLSS objects but when trying with joblib:
X_train_ci = processor_reg.transform(regression_df.drop(columns=[arguments.clf_targets] +
[arguments.reg_targets]))
y_train_ci = regression_df[arguments.reg_targets]
n_cpu = multiprocessing.cpu_count()
dtrain = xgb.DMatrix(X_train_ci, label=y_train_ci, nthread=n_cpu)
logging.info("Training XGBoost")
xgboostlss_model_expectile = xgboostlss.train(hyperparameters,
dtrain,
dist=distribution_expectile,
num_boost_round=hyperparameters["opt_rounds"],
verbose_eval=True)
if partial:
logging.info("Serializing partial fit")
joblib.dump(xgboostlss_model_expectile, 'outputs/models/zip_conf_partial_fit.joblib')
else:
logging.info("Serializing full fit")
joblib.dump(xgboostlss_model_expectile, 'outputs/models/zip_conf.joblib')
We are getting the following error:
Being distribution expectile defined as:
distribution_expectile = Expectile
distribution_expectile.expectiles = arguments.exp_list
distribution_expectile.stabilize = "MAD"
Are we choosing the wrong serialization method? Do we need to save XGBoostLSS in a different way?
BR
Edgar
When I use joblib to load a model that build by the xgblss, I found that the reloaded model lose some information. The reloaded model will output the different values from the original model even though they have the same input. The difference of the outputs is a constant for all inputs, can you fix this? I think this is because model's attributes get lost after reloading the models.
By the way, is it possible to add quantile regression to the distribution and multitasking learning? I think they are quite useful.
Is there an appetite to add a scikit learn API for this? If so, very happy to help contrib. Also for LightGBMLSS.
Hi,
Thanks for creating the XGBoostLSS (and LightGBMLSS) package, this an important extension to ML algorithms that only provide point estimates for the mean.
I have a question, or thought, about capturing uncertainty for binary classification. I would be interested in estimating the uncertainty in the model score in a binary classification, as a consequence of the amount of data that the algorithm has seen during training. If in a certain part of the feature space there was one positive and one negative training instance, the point estimate will be p=0.5, but with a lot less certainty than if there would have been a hundred positive and a hundred negative instances. One way to estimate the uncertainty in the binomial proportion p for a given number of negative (a) and positive (b) observations is to use a Beta distribution with parameters α = a + 1/2 and β = b + 1/2. This is referred to as Jeffreys interval in this paper, for example:
https://projecteuclid.org/journals/statistical-science/volume-16/issue-2/Interval-Estimation-for-a-Binomial-Proportion/10.1214/ss/1009213286.full
I wonder if the methods of XGBoostLSS can be used, or extended, to return the α, β parameters of a Beta distribution that fits the binary data observed during training. This would be a very valuable estimate of uncertainty. I see that the Beta distribution is implemented in XGBoostLSS, but if I understand correctly, this is to fit a regression on target variables that themselves follow the Beta distribution (not for classification of binomial target variables). And I'm not sure how to obtain the NLL for the Beta distribution for a binary target variable: the Beta PDF is 0 or ∞ at x=0 and x=1.
I have to admit that I am a bit out of my depth on the mathematics of this question, so if my reasoning is entirely in the wrong direction or incompatible with the idea behind XGBoostLSS, I am also happy to hear!
Hi Alexander, thanks for this incredible work. I really like this package; it's really helpful in situations where providing point forecast is not enough.
However, I'm curious about scaling to large dataset. Please pardon my ignorance. I've seen the examples in the notebook section but the datasets used isn't much.
So, my question is, does the current implementation works in similar fashion as the xgboost package? That is, one can leverage spark, dask, etc out of the box with xgboostlss?
I'll appreciate it if you can shed some light on this.
Thanks.
First of all thanks for the new pytorch version.
I've been using the previous versions and today saw a new version and wanted to give it a try, with the data and code that worked fine in previous version.
After I edited the old code with the differences of the new version, following the examples, i've noticed some problems with distributions.
So when I give my label data to the optimization or training I get something on the lines of:
Expected value argument (Tensor of shape (8700,)) to be within the support (GreaterThan(lower_bound=0.0)) of the distribution LogNormal(), but found invalid values:
tensor([0.2782, 0.3064, 0.3202, ..., 0.3338, 0.3202, 0.3202])
or
Expected parameter scale (Tensor of shape ()) of distribution Normal(loc: -974.14453125, scale: -1210.6866455078125) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
-1210.6866455078125
This happens with every distribution that I've tried.
My dataset is between 0.09 and 0.9 in value, I've tried with similar datasets and got similar results. With one dataset I've managed to run the model by multiplying the values by 10, but for other datasets does not work.
Reminding that all these datasets worked fine with previous version. Do you know what might be the reason?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.