zillow / quantile-forest Goto Github PK

Quantile Regression Forests compatible with scikit-learn.

Home Page: https://zillow.github.io/quantile-forest/

License: Apache License 2.0

Python 72.01% Cython 23.78% TeX 4.21%

python machine-learning random-forest quantile-regression quantile-regression-forests scikit-learn-api uncertainty-estimation prediction-intervals

quantile-forest's Introduction

quantile-forest

quantile-forest offers a Python implementation of quantile regression forests compatible with scikit-learn.

Quantile regression forests (QRF) are a non-parametric, tree-based ensemble method for estimating conditional quantiles, with application to high-dimensional data and uncertainty estimation [1]. The estimators in this package are performant, Cython-optimized QRF implementations that extend the forest estimators available in scikit-learn to estimate conditional quantiles. The estimators can estimate arbitrary quantiles at prediction time without retraining and provide methods for out-of-bag estimation, calculating quantile ranks, and computing proximity counts. They are compatible with and can serve as drop-in replacements for the scikit-learn variants.

Example of fitted model predictions and prediction intervals on California housing data (code)

Quick Start

Install quantile-forest from PyPI using pip:

pip install quantile-forest

Usage

from quantile_forest import RandomForestQuantileRegressor
from sklearn import datasets
X, y = datasets.fetch_california_housing(return_X_y=True)
qrf = RandomForestQuantileRegressor()
qrf.fit(X, y)
y_pred = qrf.predict(X, quantiles=[0.025, 0.5, 0.975])

Documentation

An installation guide, API documentation, and examples can be found in the documentation.

References

[1] N. Meinshausen, "Quantile Regression Forests", Journal of Machine Learning Research, 7(Jun), 983-999, 2006. http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf

Citation

If you use this package in academic work, please consider citing https://joss.theoj.org/papers/10.21105/joss.05976:

@article{Johnson2024,
    doi = {10.21105/joss.05976},
    url = {https://doi.org/10.21105/joss.05976},
    year = {2024},
    publisher = {The Open Journal},
    volume = {9},
    number = {93},
    pages = {5976},
    author = {Reid A. Johnson},
    title = {quantile-forest: A Python Package for Quantile Regression Forests},
    journal = {Journal of Open Source Software}
}

quantile-forest's People

Contributors

Stargazers

Watchers

Forkers

xmwa superpowerant xinwang-hnu js567 scnakandala jackkrasmus-vorrath kindo muammar jckkvs sfo jncraton mansi104-ai zhangc927 alessandromagna firobeid qxzsilver1 anesh-ml gugerlir mpalenciaolivar

quantile-forest's Issues

Error when importing RandomForestQuantileRegressor on windows python 3.9.9

Platform:

Windows x64

Python version: 3.9.9

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\venv\lib\site-packages\quantile_forest\__init__.py", line 3, in <module>
    from ._quantile_forest import ExtraTreesQuantileRegressor
  File "...\venv\lib\site-packages\quantile_forest\_quantile_forest.py", line 42, in <module>
    from ._quantile_forest_fast import QuantileForest
  File "quantile_forest\\_quantile_forest_fast.pyx", line 1, in init quantile_forest._quantile_forest_fast
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

How to Replica

pip install quantile-forest
python
>>> from quantile_forest import RandomForestQuantileRegressor

and it will raise the error.

Also, would you mind provide build for python 3.10 on windows?

numpy error when using latest 1.1.0 release

Hi, first of all thanks for the great package and support.

I had been using the 1.0.2 version through manually building and installing the package since I have an apple silicon mac and everything worked as intended. I updated to the latest version 1.1.0 since you added support for m1/m2 macs but now I am getting this error:

from ._quantile_forest_fast import QuantileForest File "quantile_forest/_quantile_forest_fast.pyx", line 17, in init quantile_forest._quantile_forest_fast File "__init__.cython-30.pxd", line 986, in numpy.import_array ImportError: numpy.core.multiarray failed to import

I have numpy 1.22.4. I see that you didn't specify a numpy version in the requirements but I was wondering if there was a specific version required or if you could have an idea on why this could be happening ?

Thank you very much.

RandomForestQuantileRegression Multi-output

Hello author,

First of all, thanks and congratulations for your magnificent work. I was trying to compute prediction intervals with RandomForestQuantileRegression using a multi-output approach (t+1 -> t+n)... I checked and in the code seems to be something as #ToDo... I was wondering if you are planning for an implementation soon, since will be really useful.

Look forward to your answer,
Thank you
Rafael

Using RandomForestQuantileRegressor model for uncertainty quantification

I can see how this package can be used to generate predictions for the mean, median, or any given quantile. I am interested in using quantile random forests to get an empirical distribution of predicted values that represent the model uncertainty for a new sample. This can be done in R's ranger by taking a large enough sample, with replacement, using the what argument during the predict call:

pred <- predict(rf, mtcars[27:32, ], type = "quantiles", what = function(x) sample(x, 1000, replace = TRUE))

This provides an empirical distribution of the prediction distribution. Then, if you want a single number representing uncertainty of your prediction, you can take the standard deviation of these values.

I saw a similar issue here where someone was asking how to do something like this but I wasn't 100% sure that was the same thing that I am looking for. Thank you for any clarification you can provide.

Use the quantile_forest to predict probabilities and other statistics

Hello author, thank you for your excellent work!

For Quantile Regression Forests, I have been using Nicolai Meinshausen development of R packages "quantregForest" (http://github.com/lorismichel/quantregForest), but the performance of the R is clearly Python, So when I came across the Quantle-Forest, I was delighted. I need your help with three questions.

First, is Quantile Ranks used to calculate probabilities? In quantregForest, you can achieve probability prediction by setting the "what=ecdf" parameter:

predict(object, newdata = NULL, what = , ...)

Nicolai Meinshausen explains the what parameter: Can be a vector of quantiles or a function.Default for what is a vector of quantiles (with numerical values in [0,1]) for which the conditional quantile estimates should be returned.
If a function it has to take as argument a numeric vector and return either a summary statistic (such as mean,median or sd to get conditional mean, median or standard deviation) or a vector of values (such as with quantiles or via sample) or a function (for example with ecdf).

Second, can quantile-forest be used to calculate other statistics? For example, kurtosis, skewness, variance, etc.

Third, whether the weight of the training sample corresponding to the prediction sample can be obtained. If so, these two problems will be solved.

Hope to get your feedback, wish you a happy life！

quantile-forest not installable via poetry install on M1/M2 chip Macbooks

Having quantile-forest as dependency in a pyproject.toml and then installing via poetry install does not seem to work:

  SolverProblemError

  Because <project name> depends on quantile-forest (1.0.2) which doesn't match any versions, version solving failed.

  at ~/.poetry/lib/poetry/puzzle/solver.py:241 in _solve
      237│             packages = result.packages
      238│         except OverrideNeeded as e:
      239│             return self.solve_in_compatibility_mode(e.overrides, use_latest=use_latest)
      240│         except SolveFailure as e:
    → 241│             raise SolverProblemError(e)
      242│
      243│         results = dict(
      244│             depth_first_search(
      245│                 PackageNode(self._package, packages), aggregate_package_nodes

This only seems to happen on the ARM-based Macbooks (M1/M2) and not on the older intel-based versions. Installing from source does work on the ARM-based systems.

Presumably the right wheels for those versions are not available on PyPi: https://pypi.org/project/quantile-forest/#files

Would it be possible to add the appropriate wheels for those versions, or maybe a source distribution that will install on the M1/M2 chips? I'd love to hear if there are other solutions as well.

Thanks!

Unable to instantiate without positional arguments

I'm attempting to use skops to save and load the model, but running into an issue (skops-dev/skops#383) there with loading due to the requirement of two positional arguments when the model is instantiated using the __new__ method. Calling the method on the object leads to the __cinit__ method (

quantile-forest/quantile_forest/_quantile_forest_fast.pyx

Line 568 in efc1ebd

def __cinit__(

) which requires two positional arguments.

Would it be possible to make these optional or have some default value for instantiation?

Unable to successfully run tests using Python 3.8

I'm attempting to test this project on all supported Python versions, which appears to include Python 3.8 to 3.12. However, I'm getting errors testing on Python 3.8. I believe that this is related to type hinting syntax that is not supported in 3.8.

Here's the error that I'm seeing:

quantile_forest/tests/test_quantile_forest.py:44: in <module>
FOREST_REGRESSORS: dict[str, Any] = {
E   TypeError: 'type' object is not subscriptable

If it's helpful, I'm seeing this error during an automated testing run in a fork of this project. The exact error message and environment I'm using can be seen in the Actions log.

Is Python 3.8 still supported? Are the tests expected to run correctly for that version?

This issue is related to the JOSS review of this project.

version.txt missing in PyPI's source distribution

Hello, first of all, thanks for this fast and good implementation of quantile random forest!
I'm using this package in CentOS on arm64, and there is no built distribution in PyPI. When I execute 'pip install quantile-forest', the building process throws the error like this:

(spark) E:\quantile-forest-1.1.2>python setup.py install
Partial import of quantile-forest during the build process.
Traceback (most recent call last):
File "E:\quantile-forest-1.1.2\setup.py", line 28, in
version = write_version_py()
File "E:\quantile-forest-1.1.2\setup.py", line 19, in write_version_py
with open(os.path.join("quantile_forest", "version.txt")) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'quantile_forest\version.txt'

I think that's because the file 'version.txt' in the 'quantile-forest' folder changes to 'version.py' in PyPI's source distribution. I'll appreciated if you can fix the problem, since I need to directly install this library via PyPI. Thank you in advance!

Impossible to use quantile-forest against the dev version of scikit-learn

Here is a typical traceback one would get when importing quantile_forest when scikit-learn is 1.3.0dev0 (the main branch at the time of writing).

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[56], line 2
      1 from sklearn.ensemble import RandomForestRegressor
----> 2 from quantile_forest import RandomForestQuantileRegressor
      5 min_samples_leaf = 300
      6 rf = RandomForestRegressor(criterion="poisson", min_samples_leaf=min_samples_leaf)

File ~/mambaforge/envs/dev/lib/python3.11/site-packages/quantile_forest/__init__.py:3
      1 import os
----> 3 from ._quantile_forest import ExtraTreesQuantileRegressor
      4 from ._quantile_forest import RandomForestQuantileRegressor
      6 try:

File ~/mambaforge/envs/dev/lib/python3.11/site-packages/quantile_forest/_quantile_forest.py:46
     43 from ._quantile_forest_fast import QuantileForest
     44 from ._quantile_forest_fast import generate_unsampled_indices
---> 46 sklearn_version = tuple(map(int, (sklearn.__version__.split('.'))))
     49 def _generate_unsampled_indices(sample_indices, duplicates=None):
     50     """Private function used by forest._get_unsampled_indices function."""

ValueError: invalid literal for int() with base 10: 'dev0'

Ideally this code should support any PEP 440 version format:

https://peps.python.org/pep-0440/#public-version-identifiers

Can you show me how the quantile is designed in this model?

Very large model size

First of all thank you for this nice and fast QRF implementation!

We're experimenting with the package, but when training a model with roughly 77_000 samples, 600 estimators and a max_depth of 16, the final artifact size (saved with either joblib or pickle) is roughly 35GBs. When we use more estimators, training samples or increase the max_depth, the size naturally increases even more.

I understand the model has to save the target values in each leaf node, but even then 35GB+ seems excessive. Is something going wrong on our end or is this size expected? Thanks in advance!

Fractional 'max_samples' Results in Error

Discussed in #45

^{Originally posted by fcascao10 April 10, 2024}
It seems that the parameter 'max_samples' is not working correctly. Currently, it only permits to bootstrap the length of the total dataset. For instance, using a value other than 1.0 (for the float, e.g. 0.5) will raise the error: "ValueError: could not broadcast input array from shape ('# total obs * 0.5',) into shape ('# total obs',)".