Git Product home page Git Product logo

Comments (5)

brdeleeuw avatar brdeleeuw commented on June 3, 2024 1

We managed to reduce file size substantially by (for now) taking a 2-step approach:

  • Tune hyperparameters with max_leaf_nodes on default value (None)
  • Fix these best hyperparameters and then downsample max_samples_leaf

We tested values between 1 and 100, where even 1 showed quite similar performance. max_samples_leaf=10 gave us the best trade-off in performance and file size. It went from 32GB to around 250MB and performance stayed roughly the same.

We'll consider fixing max_samples_leaf=10 for the next hyperparameter tuning iteration, so we don't have to use a 2-step approach. Though we fear it will change the underlying structure, increase file size again and therefore require additional tuning (or fixing) of max_leaf_nodes.

For now this should be sufficient, thanks again for the help!

from quantile-forest.

reidjohnson avatar reidjohnson commented on June 3, 2024

Thanks for using the package!

The model sizes can get quite large, but there are a few levers to pull.

Given your dataset size and number of estimators, it's likely that the internal model data structure is very sparse. During model fitting, you can set the keyword argument sparse_pickle to True -- that is, model.fit(X_train, y_train, sparse_pickle=True). This will convert the model data structure to a sparse matrix during serialization (i.e., with joblib or pickle); it's a lossless compression and doesn't require any other code changes.

There are also some hyperparameters that can be tuned to reduce the model size with minimal impact to model accuracy. In addition to the standard RandomForestRegressor hyperparameters such as max_leaf_nodes, there's also max_samples_leaf, which if set will randomly sample leaf nodes down to a maximum number of values. In practice, this sort of sampling can work well with QRF models that have a large number of leaf nodes.

Between these approaches, it should usually be possible to reduce the artifact size by several orders of magnitude with little or no impact on the accuracy of the model.

from quantile-forest.

brdeleeuw avatar brdeleeuw commented on June 3, 2024

Thanks for getting back to us so quickly!

The max_samples_leaf makes a lot of sense now, we initially decided to not touch this as we figured more samples will likely lead to better quantile estimation, but kind of overlooked that this parameter can make it behave more like the R implementation of QRF where they randomly keep 1 sample per leaf node. Great to have this flexibility built in!

And as for the sparse_pickle parameter, we completely missed that one. We'll try both approaches and report back any improvements or additional questions, thanks again!

from quantile-forest.

brdeleeuw avatar brdeleeuw commented on June 3, 2024

Short update: the sparse_pickle parameter helped tremendously on reducing file size on disk (35GB -> 700MB), but the memory consumption when loading the model still seems to be the same. Kind of similar to compression with joblib or pickle, is this expected?

We'll now try to look for a lower max_samples_leaf parameter that hopefully shows minimal performance loss while still being low enough to substantially reduce file size.

from quantile-forest.

reidjohnson avatar reidjohnson commented on June 3, 2024

Thanks for sharing the update.

The memory consumption does sound in line with expectation. The sparse matrix is only used during model serialization/deserialization; conversion back into a dense matrix would (sadly) be expected to incur the same memory overhead under the current implementation.

If the memory overhead is a remaining issue, sweeping some of the hyperparameters certainly seems prudent -- in particular tuning max_samples_leaf as you noted. Other hyperparameters such as min_samples_split and max_leaf_nodes might also be worth considering.

from quantile-forest.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.