First of all thank you for this nice and fast QRF implementation! We

Thanks for getting back to us so quickly! The <code class="notransla

Very large model size about quantile-forest HOT 5 CLOSED

brdeleeuw commented on June 3, 2024

Very large model size

from quantile-forest.

Comments (5)

brdeleeuw commented on June 3, 2024 1

We managed to reduce file size substantially by (for now) taking a 2-step approach:

Tune hyperparameters with max_leaf_nodes on default value (None)
Fix these best hyperparameters and then downsample max_samples_leaf

We tested values between 1 and 100, where even 1 showed quite similar performance. max_samples_leaf=10 gave us the best trade-off in performance and file size. It went from 32GB to around 250MB and performance stayed roughly the same.

We'll consider fixing max_samples_leaf=10 for the next hyperparameter tuning iteration, so we don't have to use a 2-step approach. Though we fear it will change the underlying structure, increase file size again and therefore require additional tuning (or fixing) of max_leaf_nodes.

For now this should be sufficient, thanks again for the help!

from quantile-forest.

reidjohnson commented on June 3, 2024

Thanks for using the package!

The model sizes can get quite large, but there are a few levers to pull.

Given your dataset size and number of estimators, it's likely that the internal model data structure is very sparse. During model fitting, you can set the keyword argument sparse_pickle to True -- that is, model.fit(X_train, y_train, sparse_pickle=True). This will convert the model data structure to a sparse matrix during serialization (i.e., with joblib or pickle); it's a lossless compression and doesn't require any other code changes.

There are also some hyperparameters that can be tuned to reduce the model size with minimal impact to model accuracy. In addition to the standard RandomForestRegressor hyperparameters such as max_leaf_nodes, there's also max_samples_leaf, which if set will randomly sample leaf nodes down to a maximum number of values. In practice, this sort of sampling can work well with QRF models that have a large number of leaf nodes.

Between these approaches, it should usually be possible to reduce the artifact size by several orders of magnitude with little or no impact on the accuracy of the model.

from quantile-forest.

brdeleeuw commented on June 3, 2024

Thanks for getting back to us so quickly!

The max_samples_leaf makes a lot of sense now, we initially decided to not touch this as we figured more samples will likely lead to better quantile estimation, but kind of overlooked that this parameter can make it behave more like the R implementation of QRF where they randomly keep 1 sample per leaf node. Great to have this flexibility built in!

And as for the sparse_pickle parameter, we completely missed that one. We'll try both approaches and report back any improvements or additional questions, thanks again!

from quantile-forest.

brdeleeuw commented on June 3, 2024

Short update: the sparse_pickle parameter helped tremendously on reducing file size on disk (35GB -> 700MB), but the memory consumption when loading the model still seems to be the same. Kind of similar to compression with joblib or pickle, is this expected?

We'll now try to look for a lower max_samples_leaf parameter that hopefully shows minimal performance loss while still being low enough to substantially reduce file size.

from quantile-forest.

reidjohnson commented on June 3, 2024

Thanks for sharing the update.

The memory consumption does sound in line with expectation. The sparse matrix is only used during model serialization/deserialization; conversion back into a dense matrix would (sadly) be expected to incur the same memory overhead under the current implementation.

If the memory overhead is a remaining issue, sweeping some of the hyperparameters certainly seems prudent -- in particular tuning max_samples_leaf as you noted. Other hyperparameters such as min_samples_split and max_leaf_nodes might also be worth considering.

from quantile-forest.

Very large model size about quantile-forest HOT 5 CLOSED

Comments (5)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent