Git Product home page Git Product logo

google / yggdrasil-decision-forests Goto Github PK

View Code? Open in Web Editor NEW
422.0 11.0 44.0 37.2 MB

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.

License: Apache License 2.0

Starlark 4.28% Shell 0.71% Batchfile 0.12% C++ 77.68% C 0.03% PureBasic 0.09% HTML 0.07% JavaScript 0.87% Go 3.85% Python 12.30%
cpp cli tensorflow machine-learning ml decision-trees decision-forest random-forest gradient-boosting cart interpretability distributed-computing go javascript pypi python

yggdrasil-decision-forests's Introduction

Yggdrasil Decision Forests (YDF) is a production-grade collection of algorithms developed in Google Switzerland πŸ”οΈ since 2018 for the training, serving, and interpretation of decision forest models. YDF is available in Python, C++, CLI, in TensorFlow under the name TensorFlow Decision Forests, JavaScript (inference only), and Go (inference only).

To learn more about YDF, see the documentation.

For more information on the design of YDF, see our paper at KDD 2023: Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library.

Key features

  • A simple API for training, evaluation and serving of decision forests models.
  • Supports Random Forest, Gradient Boosted Trees and Carts, and advanced learning algorithm such as oblique splits, honest trees, hessian and non-hessian scores, and global tree optimizations.
  • Train classification, regression, ranking, and uplifting models.
  • Fast model inference in cpu (microseconds / example / cpu-core).
  • Supports distributed training over billions of examples.
  • Serving in Python, C++, TensorFlow Serving, Go, JavaScript, and CLI.
  • Rich report for model description (e.g., training logs, plot trees), analysis (e.g., variable importances, partial dependence plots, conditional dependence plots), evaluation (e.g., accuracy, AUC, ROC plots, RMSE, confidence intervals), tuning (trials configuration and scores), and cross-validation.
  • Natively consumes numerical, categorical, boolean, text, and missing values.
  • Backward compatibility for model and learners since 2018.
  • Consumes Pandas Dataframes, Numpy arrays, TensorFlow Dataset and CSV files.

Installation

To install YDF in Python from PyPi, run:

pip install ydf

Usage example

Example with the Python API.

import ydf
import pandas as pd

train_ds = pd.read_csv("adult_train.csv")
test_ds = pd.read_csv("adult_test.csv")

# Train a model
model = ydf.GradientBoostedTreesLearner(label="income").train(train_ds)

# Look at a model (input features, training logs, structure, etc.)
model.describe()

# Evaluate a model (e.g. roc, accuracy, confusion matrix, confidence intervals)
model.evaluate(test_ds)

# Generate predictions
model.predict(test_ds)

# Analyse a model (e.g. partial dependence plot, variable importance)
model.analyze(test_ds)

# Benchmark the inference speed of a model
model.benchmark(test_ds)

# Save the model
model.save("/tmp/my_model")

Example with the C++ API.

auto dataset_path = "csv:train.csv";

// List columns in training dataset
DataSpecification spec;
CreateDataSpec(dataset_path, false, {}, &spec);

// Create a training configuration
TrainingConfig train_config;
train_config.set_learner("RANDOM_FOREST");
train_config.set_task(Task::CLASSIFICATION);
train_config.set_label("my_label");

// Train model
std::unique_ptr<AbstractLearner> learner;
GetLearner(train_config, &learner);
auto model = learner->Train(dataset_path, spec);

// Export model
SaveModel("my_model", model.get());

(based on examples/beginner.cc)

The same model can be trained in Python using TensorFlow Decision Forests as follows:

import tensorflow_decision_forests as tfdf
import pandas as pd

# Load dataset in a Pandas dataframe.
train_df = pd.read_csv("project/train.csv")

# Convert dataset into a TensorFlow dataset.
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="my_label")

# Train model
model = tfdf.keras.RandomForestModel()
model.fit(train_ds)

# Export model.
model.save("project/model")

Next steps

Check the Getting Started tutorial 🧭.

Google I/O Presentation

Yggdrasil Decision Forests powers TensorFlow Decision Forests.

Citation

If you us Yggdrasil Decision Forests in a scientific publication, please cite the following paper: Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library.

Bibtex

@inproceedings{GBBSP23,
  author       = {Mathieu Guillame{-}Bert and
                  Sebastian Bruch and
                  Richard Stotz and
                  Jan Pfeifer},
  title        = {Yggdrasil Decision Forests: {A} Fast and Extensible Decision Forests
                  Library},
  booktitle    = {Proceedings of the 29th {ACM} {SIGKDD} Conference on Knowledge Discovery
                  and Data Mining, {KDD} 2023, Long Beach, CA, USA, August 6-10, 2023},
  pages        = {4068--4077},
  year         = {2023},
  url          = {https://doi.org/10.1145/3580305.3599933},
  doi          = {10.1145/3580305.3599933},
}

Raw

Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library, Guillame-Bert et al., KDD 2023: 4068-4077. doi:10.1145/3580305.3599933

Contact

You can contact the core development team at [email protected].

Credits

Yggdrasil Decision Forests and TensorFlow Decision Forests are developed by:

  • Mathieu Guillame-Bert (gbm AT google DOT com)
  • Jan Pfeifer (janpf AT google DOT com)
  • Sebastian Bruch (sebastian AT bruch DOT io)
  • Richard Stotz (richardstotz AT google DOT com)
  • Arvind Srinivasan (arvnd AT google DOT com)

Contributing

Contributions to TensorFlow Decision Forests and Yggdrasil Decision Forests are welcome. If you want to contribute, check the contribution guidelines.

License

Apache License 2.0

yggdrasil-decision-forests's People

Contributors

achoum avatar arnoegw avatar arvnds avatar bzz avatar dependabot[bot] avatar hchiam avatar janpfeifer avatar jblespiau avatar joker-eph avatar katre avatar mattsoulanille avatar neutrovertido avatar rickeylev avatar rstz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yggdrasil-decision-forests's Issues

Cannot use 'discretize_numerical_columns' in tuner

I am trying to add the 'discretize_nmerical_colums' to a tuner.choice

tuner.choice('discretize_numerical_columns',[False, True])

But i get the following error

ValueError: INVALID_ARGUMENT: Unknown param "discretize_numerical_columns".

This comes from:-

File ~/miniforge3/envs/StAnd/lib/python3.11/site-packages/ydf/learner/generic_learner.py:238, in GenericLearner._train_from_dataset(self, ds, valid)
232 log.info(
233 "Train model on %d examples",
234 train_ds.nrow(),
235 )
237 time_begin_training_model = datetime.datetime.now()
--> 238 cc_model = self._get_learner().Train(**train_args)
239 log.info(
240 "Model trained in %s",
241 datetime.datetime.now() - time_begin_training_model,
242 )
244 return model_lib.load_cc_model(cc_model)

If i put it into the RandomForestLearner then it works ok.
Can I use this parameter in the tuner?

Documentation for using Go is out of date

The documentation indicates to use an import starting with "google3/...." but this needs to be:

import (
	modelio "github.com/google/yggdrasil-decision-forests/yggdrasil_decision_forests/port/go/model/io/canonical"
	"github.com/google/yggdrasil-decision-forests/yggdrasil_decision_forests/port/go/serving"
)

Also the README and the current docs show using model_io as an import alias, but it is not idiomatic to use _ in variables or aliess etc.

On MacOSX, Mac M Hardware (ARM), a segmentation fault happened with YDF when pyarrow is installed

Setup : MacOSX 13 or 14, Mac M hardware

Prerequisite : Install miniforge3

% conda create --name ydfpandasissue
% conda activate ydfpandasissue
% conda install python=3.10
% conda install pandas
% pip install ydf-0.2.0-cp310-cp310-macosx_13_0_arm64.whl

When running this program (ydf_test.py), it works.

import ydf
import pandas as pd
import numpy as np

dataset = {
    "x1": np.array([0, 0, 0, 1, 1, 1]),
    "x2": np.array([1, 1, 0, 0, 1, 1]),
    "y": np.array([0, 0, 0, 0, 1, 1]),
}

model = ydf.CartLearner(label="y", min_examples=1, task=ydf.Task.CLASSIFICATION).train(dataset)
print(model.describe())

Now install pyarrow from conda or pip the result is the same: it fails
Only the error message is different.

% conda install pyarrow
% python ydf_test.py
zsh: segmentation fault  python ydf_test.py
% conda uninstall pyarrow
% pip install pyarrow
% python ydf_test.py
libc++abi: terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
zsh: abort      python ydf_test.py

Note that pyarrow is mandatory when we work on big tabular dataset stored in parquet files.

release-tag pre-compiled binaries for armv7

Could we get some pre-compiled binaries for armv7. I'm currently trying to compile on a rpi 3b and stuck on:

Compiling org_tensorflow/tensorflow/core/framework/node_def_util.cc; 12299s local
Compiling org_tensorflow/tensorflow/core/util/batch_util.cc; 12280s local
[Sched] Compiling org_tensorflow/tensorflow/core/util/tensor_slice_set.cc; 12299s
[Sched] Compiling org_tensorflow/tensorflow/core/util/matmul_autotune.cc; 12280s

Note I used the config=use_tensorflow_io flag which I think installs the tensorflow io libraries. However makes the compiling much longer. I had to do it this way because of an error I received using default settings. Also had to take out the

# Instruction set optimizations
build:linux_avx2 --copt=-mavx2

because it would error out with -maxvx2 is a unknown command.

Also unrelated question:
is there anyway we could get hypertuner support for tensorflow decision forest?
https://www.tensorflow.org/tutorials/keras/keras_tuner#instantiate_the_tuner_and_perform_hypertuning

Like on sklearn you have gridsearchcv, keras NN's have hypertuner. Would be nice to have this optimization for decision forests.

Best Regards,

Running quick Scorer Extended Model

Could you please let me know how to run the quick scorer extended model? There is a test file quick_scorer_extended_test.cc, but it creates a toy model on a toy dataset. I want something similar to the examples available in examples/beginner_cc, but that example does not show how to run the quick scorer algorithm.

I need to train a Classification model with GradientBoostedTrees on a CSV dataset and convert the trained model to GradientBoostedTreesBinaryClassificationQuickScorerExtended model to perform fast inference. How to update examples/beginner_cc? Can anyone guide me on this?

Simple Model problem

Hi,

I have the following example. How to rewrite the code - so I can use the model Predict for the given input? There is no example in the docs...

#include "yggdrasil_decision_forests/dataset/data_spec.h"
#include "yggdrasil_decision_forests/dataset/data_spec.pb.h"
#include "yggdrasil_decision_forests/dataset/data_spec_inference.h"
#include "yggdrasil_decision_forests/dataset/vertical_dataset_io.h"
#include "yggdrasil_decision_forests/learner/learner_library.h"
#include "yggdrasil_decision_forests/metric/metric.h"
#include "yggdrasil_decision_forests/metric/report.h"
#include "yggdrasil_decision_forests/model/model_library.h"
#include "yggdrasil_decision_forests/utils/filesystem.h"
#include "yggdrasil_decision_forests/utils/logging.h"
#include "yggdrasil_decision_forests/serving/decision_forest/decision_forest.h"
#include <chrono>

namespace ygg = yggdrasil_decision_forests;

int main(int argc, char** argv) {
  // Enable the logging. Optional in most cases.
  InitLogging(argv[0], &argc, &argv, true);

  // Import the model.
  LOG(INFO) << "Import the model";
  const std::string model_path = "/tmp/my_saved_model/1/assets";
  std::unique_ptr<ygg::model::AbstractModel> model;
  QCHECK_OK(ygg::model::LoadModel(model_path, &model));

  // Show information about the model.
  // Like :show_model, but without the list of compatible engines.
  std::string model_description;
  model->AppendDescriptionAndStatistics(/*full_definition=*/false,
                                        &model_description);
  LOG(INFO) << "Model:\n" << model_description;

  auto start = std::chrono::high_resolution_clock::now();

  // Compile the model for fast inference.
  const std::unique_ptr<ygg::serving::FastEngine> serving_engine =
      model->BuildFastEngine().value();
  const auto& features = serving_engine->features();

  // Handle to two features.
  const auto age_feature = features.GetNumericalFeatureId("age").value();
  const auto sex_feature =
      features.GetCategoricalFeatureId("sex").value();

  // Allocate a batch of 1 examples.
  std::unique_ptr<ygg::serving::AbstractExampleSet> examples =
      serving_engine->AllocateExamples(1);

  // Set all the values as missing. This is only necessary if you don't set all
  // the feature values manually e.g. SetNumerical.
  //examples->FillMissing(features);

  // Set the value of "age" and "eduction" for the first example.
  examples->SetNumerical(/*example_idx=*/0, age_feature, 50.f, features);
  examples->SetCategorical(/*example_idx=*/0, sex_feature, "Male",
                           features);

  // Run the predictions on the first two examples.
  std::vector<float> batch_of_predictions;
  serving_engine->Predict(*examples, 1, &batch_of_predictions);

  auto stop = high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<milliseconds>(stop - start);

  // To get the value of duration use the count()
  // member function on the duration object
  LOG(INFO) << duration.count();

  LOG(INFO) << "Predictions:";
  for (const float prediction : batch_of_predictions) {
    LOG(INFO) << "\t" << prediction;
  }

  return 0;
}

Output:

[INFO beginner4.cc:31] Import the model
[INFO beginner4.cc:41] Model:
Type: "RANDOM_FOREST"
Task: CLASSIFICATION
Label: "__LABEL"

Input Features (2):
	age
	sex

No weights

Variable Importance: MEAN_MIN_DEPTH:
    1. "__LABEL"  8.480250 ################
    2.     "sex"  1.313142 ##
    3.     "age"  0.000000 

Variable Importance: NUM_AS_ROOT:
    1. "age" 300.000000 

Variable Importance: NUM_NODES:
    1. "age" 34255.000000 ################
    2. "sex" 1584.000000 

Variable Importance: SUM_SCORE:
    1. "age" 516361.208020 ################
    2. "sex" 148174.885377 



Winner take all: true
Out-of-bag evaluation: accuracy:0.756613 logloss:5.68258
Number of trees: 300
Total number of nodes: 71978

Number of nodes by tree:
Count: 300 Average: 239.927 StdDev: 4.94044
Min: 223 Max: 251 Ignored: 0
----------------------------------------------
[ 223, 224)  1   0.33%   0.33%
[ 224, 225)  0   0.00%   0.33%
[ 225, 227)  1   0.33%   0.67%
[ 227, 228)  3   1.00%   1.67% #
[ 228, 230)  3   1.00%   2.67% #
[ 230, 231)  0   0.00%   2.67%
[ 231, 233)  9   3.00%   5.67% ##
[ 233, 234) 17   5.67%  11.33% ###
[ 234, 236) 33  11.00%  22.33% ######
[ 236, 237)  0   0.00%  22.33%
[ 237, 238) 28   9.33%  31.67% #####
[ 238, 240) 51  17.00%  48.67% ##########
[ 240, 241)  0   0.00%  48.67%
[ 241, 243) 46  15.33%  64.00% #########
[ 243, 244) 47  15.67%  79.67% #########
[ 244, 246) 34  11.33%  91.00% #######
[ 246, 247)  0   0.00%  91.00%
[ 247, 249) 13   4.33%  95.33% ###
[ 249, 250) 10   3.33%  98.67% ##
[ 250, 251]  4   1.33% 100.00% #

Depth by leafs:
Count: 36139 Average: 8.48037 StdDev: 2.34049
Min: 3 Max: 15 Ignored: 0
----------------------------------------------
[  3,  4)   70   0.19%   0.19%
[  4,  5)  662   1.83%   2.03% #
[  5,  6) 3633  10.05%  12.08% ######
[  6,  7) 3980  11.01%  23.09% #######
[  7,  8) 4520  12.51%  35.60% ########
[  8,  9) 5143  14.23%  49.83% #########
[  9, 10) 5817  16.10%  65.93% ##########
[ 10, 11) 5152  14.26%  80.18% #########
[ 11, 12) 3519   9.74%  89.92% ######
[ 12, 13) 2003   5.54%  95.46% ###
[ 13, 14)  975   2.70%  98.16% ##
[ 14, 15)  483   1.34%  99.50% #
[ 15, 15]  182   0.50% 100.00%

Number of training obs by leaf:
Count: 36139 Average: 190.182 StdDev: 179.903
Min: 5 Max: 2370 Ignored: 0
----------------------------------------------
[    5,  123) 14849  41.09%  41.09% ##########
[  123,  241) 10680  29.55%  70.64% #######
[  241,  359)  3917  10.84%  81.48% ###
[  359,  478)  6234  17.25%  98.73% ####
[  478,  596)   137   0.38%  99.11%
[  596,  714)     7   0.02%  99.13%
[  714,  833)    17   0.05%  99.18%
[  833,  951)    19   0.05%  99.23%
[  951, 1069)     6   0.02%  99.24%
[ 1069, 1188)   135   0.37%  99.62%
[ 1188, 1306)    33   0.09%  99.71%
[ 1306, 1424)     0   0.00%  99.71%
[ 1424, 1542)     0   0.00%  99.71%
[ 1542, 1661)     8   0.02%  99.73%
[ 1661, 1779)    53   0.15%  99.88%
[ 1779, 1897)     6   0.02%  99.89%
[ 1897, 2016)     0   0.00%  99.89%
[ 2016, 2134)     2   0.01%  99.90%
[ 2134, 2252)    28   0.08%  99.98%
[ 2252, 2370]     8   0.02% 100.00%

Attribute in nodes:
	34255 : age [NUMERICAL]
	1584 : sex [CATEGORICAL]

Attribute in nodes with depth <= 0:
	300 : age [NUMERICAL]

Attribute in nodes with depth <= 1:
	600 : age [NUMERICAL]
	300 : sex [CATEGORICAL]

Attribute in nodes with depth <= 2:
	1721 : age [NUMERICAL]
	379 : sex [CATEGORICAL]

Attribute in nodes with depth <= 3:
	3328 : age [NUMERICAL]
	1102 : sex [CATEGORICAL]

Attribute in nodes with depth <= 5:
	11208 : age [NUMERICAL]
	1583 : sex [CATEGORICAL]

Condition type in nodes:
	34255 : HigherCondition
	1584 : ContainsBitmapCondition
Condition type in nodes with depth <= 0:
	300 : HigherCondition
Condition type in nodes with depth <= 1:
	600 : HigherCondition
	300 : ContainsBitmapCondition
Condition type in nodes with depth <= 2:
	1721 : HigherCondition
	379 : ContainsBitmapCondition
Condition type in nodes with depth <= 3:
	3328 : HigherCondition
	1102 : ContainsBitmapCondition
Condition type in nodes with depth <= 5:
	11208 : HigherCondition
	1583 : ContainsBitmapCondition
Node format: BLOB_SEQUENCE

Training OOB:
	trees: 1, Out-of-bag evaluation: accuracy:0.750237 logloss:9.00239
	trees: 13, Out-of-bag evaluation: accuracy:0.754722 logloss:7.09704
	trees: 23, Out-of-bag evaluation: accuracy:0.753863 logloss:6.3117
	trees: 33, Out-of-bag evaluation: accuracy:0.75395 logloss:6.19856
	trees: 43, Out-of-bag evaluation: accuracy:0.754299 logloss:6.0429
	trees: 53, Out-of-bag evaluation: accuracy:0.753165 logloss:5.9747
	trees: 63, Out-of-bag evaluation: accuracy:0.754867 logloss:5.96594
	trees: 73, Out-of-bag evaluation: accuracy:0.75443 logloss:5.92934
	trees: 83, Out-of-bag evaluation: accuracy:0.754954 logloss:5.92356
	trees: 93, Out-of-bag evaluation: accuracy:0.756307 logloss:5.89293
	trees: 103, Out-of-bag evaluation: accuracy:0.756569 logloss:5.89233
	trees: 113, Out-of-bag evaluation: accuracy:0.755696 logloss:5.81216
	trees: 123, Out-of-bag evaluation: accuracy:0.755653 logloss:5.81
	trees: 133, Out-of-bag evaluation: accuracy:0.755347 logloss:5.80588
	trees: 143, Out-of-bag evaluation: accuracy:0.755914 logloss:5.77847
	trees: 153, Out-of-bag evaluation: accuracy:0.755783 logloss:5.77828
	trees: 163, Out-of-bag evaluation: accuracy:0.755522 logloss:5.74301
	trees: 173, Out-of-bag evaluation: accuracy:0.756264 logloss:5.73794
	trees: 183, Out-of-bag evaluation: accuracy:0.756176 logloss:5.7384
	trees: 193, Out-of-bag evaluation: accuracy:0.756831 logloss:5.73852
	trees: 203, Out-of-bag evaluation: accuracy:0.756613 logloss:5.73565
	trees: 213, Out-of-bag evaluation: accuracy:0.757268 logloss:5.73545
	trees: 223, Out-of-bag evaluation: accuracy:0.757486 logloss:5.7286
	trees: 233, Out-of-bag evaluation: accuracy:0.757093 logloss:5.72556
	trees: 243, Out-of-bag evaluation: accuracy:0.757093 logloss:5.7087
	trees: 253, Out-of-bag evaluation: accuracy:0.757224 logloss:5.70704
	trees: 263, Out-of-bag evaluation: accuracy:0.757006 logloss:5.69758
	trees: 273, Out-of-bag evaluation: accuracy:0.756831 logloss:5.69816
	trees: 283, Out-of-bag evaluation: accuracy:0.756526 logloss:5.69813
	trees: 293, Out-of-bag evaluation: accuracy:0.756526 logloss:5.68217
	trees: 300, Out-of-bag evaluation: accuracy:0.756613 logloss:5.68258

[INFO decision_forest.cc:639] Model loaded with 300 root(s), 71978 node(s), and 2 input feature(s).
[INFO abstract_model.cc:1158] Engine "RandomForestOptPred" built
[INFO beginner4.cc:79] 18
[INFO beginner4.cc:81] Predictions:
[INFO beginner4.cc:83] 	0.649999

Lets say I want to continuously read the inference input data from stdin - I only need to load model and initialized serving_engine once?

every time stdin has new data to run a prediction I just need to run the following code? does anything change when I change to a http inference - anything to consider regarding thread safety - I cant share the model and serving_engine across multiple threads?

Is there anything I should change when I dont need batching? I still need to use std::unique_ptr<ygg::serving::AbstractExampleSet> examples = serving_engine->AllocateExamples(1)?

  // Handle to two features.
  const auto age_feature = features.GetNumericalFeatureId("age").value();
  const auto sex_feature =
      features.GetCategoricalFeatureId("sex").value();

  // Allocate a batch of 1 examples.
  std::unique_ptr<ygg::serving::AbstractExampleSet> examples =
      serving_engine->AllocateExamples(1);

  // Set all the values as missing. This is only necessary if you don't set all
  // the feature values manually e.g. SetNumerical.
  //examples->FillMissing(features);

  // Set the value of "age" and "eduction" for the first example.
  examples->SetNumerical(/*example_idx=*/0, age_feature, 50.f, features);
  examples->SetCategorical(/*example_idx=*/0, sex_feature, "Male",
                           features);

  // Run the predictions on the first two examples.
  std::vector<float> batch_of_predictions;
  serving_engine->Predict(*examples, 1, &batch_of_predictions);

  auto stop = high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<milliseconds>(stop - start);

  // To get the value of duration use the count()
  // member function on the duration object
  LOG(INFO) << duration.count();

  LOG(INFO) << "Predictions:";
  for (const float prediction : batch_of_predictions) {
    LOG(INFO) << "\t" << prediction;
  }

I also tried to use the c-api here, but always get as Result Tensor: 1.0? any idea why?

// gcc -I/usr/local/include -L/usr/local/lib main.c -ltensorflow -o main
#include <stdio.h>
#include <tensorflow/c/c_api.h>

void NoOpDeallocator(void* data, size_t a, void* b) {}

int main() {
  TF_Graph *Graph = TF_NewGraph();
  TF_Status *Status = TF_NewStatus();
  TF_SessionOptions *SessionOpts = TF_NewSessionOptions();
  TF_Buffer *RunOpts = NULL;
  TF_Library *library;

  library = TF_LoadLibrary("/usr/local/lib/python3.10/dist-packages/tensorflow_decision_forests/tensorflow/ops/inference/inference.so",
                              Status);

  const char *saved_model_dir = "/tmp/my_saved_model/1/";
  const char *tags = "serve";
  int ntags = 1;

  TF_Session *Session = TF_LoadSessionFromSavedModel(
      SessionOpts, RunOpts, saved_model_dir, &tags, ntags, Graph, NULL, Status);

  printf("status: %s\n", TF_Message(Status));

  if(TF_GetCode(Status) == TF_OK) {
    printf("loaded\n");
  }else{
    printf("not loaded\n");
  }

  /* Get Input Tensor */
  int NumInputs = 2;

  TF_Output* Input = malloc(sizeof(TF_Output) * NumInputs);
  TF_Output t0 = {TF_GraphOperationByName(Graph, "serving_default_age"), 0};

  if(t0.oper == NULL)
    printf("ERROR: Failed TF_GraphOperationByName serving_default_input_1\n");
  else
    printf("TF_GraphOperationByName serving_default_input_1 is OK\n");

  TF_Output t1 = {TF_GraphOperationByName(Graph, "serving_default_sex"), 0};

  if(t1.oper == NULL)
    printf("ERROR: Failed TF_GraphOperationByName serving_default_input_2\n");
  else
    printf("TF_GraphOperationByName serving_default_input_2 is OK\n");

  Input[0] = t0;
  Input[1] = t1;

  // Get Output tensor
  int NumOutputs = 1;
  TF_Output* Output = malloc(sizeof(TF_Output) * NumOutputs);
  TF_Output tout = {TF_GraphOperationByName(Graph, "StatefulPartitionedCall_1"), 0};

  if(tout.oper == NULL)
      printf("ERROR: Failed TF_GraphOperationByName StatefulPartitionedCall\n");
  else
    printf("TF_GraphOperationByName StatefulPartitionedCall is OK\n");

  Output[0] = tout;

  /* Allocate data for inputs and outputs */
  TF_Tensor** InputValues  = (TF_Tensor**)malloc(sizeof(TF_Tensor*)*NumInputs);
  TF_Tensor** OutputValues = (TF_Tensor**)malloc(sizeof(TF_Tensor*)*NumOutputs);

  int ndims = 1;
  int64_t dims[] = {1};
  int64_t data[] = {50};

  int ndata = sizeof(int64_t);
  TF_Tensor* int_tensor0 = TF_NewTensor(TF_INT64, dims, ndims, data, ndata, &NoOpDeallocator, 0);

  if (int_tensor0 != NULL)
    printf("TF_NewTensor is OK\n");
  else
    printf("ERROR: Failed TF_NewTensor\n");

  const char test_string[] = "Male";
  TF_TString tstr[1];
  TF_TString_Init(&tstr[0]);
  TF_TString_Copy(&tstr[0], test_string, sizeof(test_string)-1);
  TF_Tensor* int_tensor1 = TF_NewTensor(TF_STRING, NULL, 0, &tstr[0], sizeof(tstr), &NoOpDeallocator, 0);

  if (int_tensor1 != NULL)
    printf("TF_NewTensor is OK\n");
  else
    printf("ERROR: Failed TF_NewTensor\n");

  InputValues[0] = int_tensor0;
  InputValues[1] = int_tensor1;

  // Run the Session
  TF_SessionRun(Session,
                NULL, // Run options.
                Input, InputValues, NumInputs, // Input tensors name, input tensor values, number of inputs.
                Output, OutputValues, NumOutputs, // Output tensors name, output tensor values, number of outputs.
                NULL, 0, // Target operations, number of targets.
                NULL,
                Status); // Output status.

  if(TF_GetCode(Status) == TF_OK)
    printf("Session is OK\n");
  else
    printf("%s",TF_Message(Status));

  // Free memory
  TF_DeleteGraph(Graph);
  TF_DeleteSession(Session, Status);
  TF_DeleteSessionOptions(SessionOpts);
  TF_DeleteStatus(Status);

  /* Get Output Result */
  void* buff = TF_TensorData(OutputValues[0]);
  float* offsets = (float*)buff;
  printf("Result Tensor :\n");
  printf("%f\n",offsets[0]);
  return 0;
}

Output:

# gcc -I/usr/local/include -L/usr/local/lib main.c -ltensorflow -o main; ./main
2022-06-16 21:05:24.070748: I tensorflow/cc/saved_model/reader.cc:43] Reading SavedModel from: /tmp/my_saved_model/1/
2022-06-16 21:05:24.072148: I tensorflow/cc/saved_model/reader.cc:81] Reading meta graph with tags { serve }
2022-06-16 21:05:24.072208: I tensorflow/cc/saved_model/reader.cc:122] Reading SavedModel debug info (if present) from: /tmp/my_saved_model/1/
2022-06-16 21:05:24.072280: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-16 21:05:24.085806: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2022-06-16 21:05:24.086985: I tensorflow/cc/saved_model/loader.cc:228] Restoring SavedModel bundle.
2022-06-16 21:05:24.116570: I tensorflow/cc/saved_model/loader.cc:212] Running initialization op on SavedModel bundle at path: /tmp/my_saved_model/1/
[INFO kernel.cc:1176] Loading model from path /tmp/my_saved_model/1/assets/ with prefix cf8326335a66430a
[INFO decision_forest.cc:639] Model loaded with 300 root(s), 71978 node(s), and 2 input feature(s).
[INFO abstract_model.cc:1246] Engine "RandomForestOptPred" built
[INFO kernel.cc:1022] Use fast generic engine
2022-06-16 21:05:24.373058: I tensorflow/cc/saved_model/loader.cc:301] SavedModel load for tags { serve }; Status: success: OK. Took 302321 microseconds.
status: 
loaded
TF_GraphOperationByName serving_default_input_1 is OK
TF_GraphOperationByName serving_default_input_2 is OK
TF_GraphOperationByName StatefulPartitionedCall is OK
TF_NewTensor is OK
TF_NewTensor is OK
Session is OK
Result Tensor :
1.000000

Any way to benefit from retraining?

Most users who try to train a classifier have to carry out several attempts at training until they get acceptable results. This means that in each consecutive attempt they have to resend and retrain the classifier using almost the same training data, with only a few added samples. This seems wasteful. Is there any way to use Yggdrasil to benefit from this knowledge?

building off external hard drive

root@mocha-eft:/mnt/sdcard/yggdrasil-decision-forests# bazel --output_user_root=/mnt/sdcard/install build //examples:beg
inner_cc --config=linux_cpp17 --config=linux_avx2
Extracting Bazel installation...
FATAL: failed to create installation symlink '/mnt/sdcard/install/32ee77bc3907dda3edc97c30cd47096e/install': (error: 1): Operation not permitted
root@mocha-eft:/mnt/sdcard/yggdrasil-decision-forests#

Does anyone know if you can use bazel on a external hard disk. For example say I mounted an sd card on my raspberry pi. Can I use bazel to build from the mounted drive? I know bazel uses java which can cause some weird permission issues im assuming.

Yggdrasil on the GPU

Hi and Happy New Year!

In one of our chats, it was mentioned that:

"Inference on GPU of decision forests is very fast (but it requires a GPU). In some old experiments I did, I observed a ~30x speed-up when comparing the CPU version (single-threaded) with a basic GPU implementation"

Given such an improvement on speed, I'd be interested in exploring a GPU implementation. Currently I use a GBT model with the quickscorer algorithm. How should I go about changing this into a GPU implementation? Here I'm just looking for general guidelines since I haven't seen any obvious ways of doing this in the documentation. Are there any flags that I can set (either when compiling the lib or during implementation) to facilitate GPU compilation? Does the quickscorer algorithm allow for GPU implementation? I've read that specific engines are tied up to specific hardware (e.g. GPU), but I'm not sure if this also applies to a GradientBoostedTreesLearner?

Thanks in advance for your help.

Follow Up Question

Follow up question of last issue.

Hi again. We looked further into this matter, comparing the implemented algorithms, and I think we found a difference. To this time we are unsure, if this difference is of significance but wanted to share in case this information is useful.

While comparing we were looking at the trees that the algorithms produced. We figured that trees in Yggdrasil were getting way larger than in XGBoost. In order to find out why, we compared the way the two algorithms are growing their trees and shed a light on how the algorithms decide if the split will happen.
We think we found the reason on why the trees grow larger.

To get a better view on the internals we used this small dataset:

Index Feature_a Feature_b Label
0 0 1 0
1 0 0 0
2 1 1 1
3 0 1 0

Splits

Xgboost split

Split 1

Best split candidate:

  • Feature: Feature_a
  • Root Gain: 1
  • Index Left: 2
  • Index Right: 0, 1, 3
  • Score Left: 1
  • Score Right: 3
  • Split happening: Yes

Split 2

Index Feature_a Feature_b Label
0 0 1 0
1 0 0 0
2 0 1 0

Best split candidate:

  • Feature: Feature_b
  • Root Gain: 3
  • Index Left: 0, 2
  • Index Right: 1
  • Score Left: 2
  • Score Right: 1
  • Split happening: No

Yggdrasil split

Split 1

Best split candidate:

  • Feature: Feature_a
  • Root Gain: 0
  • Index Left: 2
  • Index Right: 0, 1, 3
  • Score Left: 1
  • Score Right: 3
  • Split happening: Yes

Split 2

Index Feature_a Feature_b Label
0 0 1 0
1 0 0 0
2 0 1 0

Best split candidate:

  • Feature: Feature_b
  • Root Gain: 0
  • Index Left: 0, 2
  • Index Right: 1
  • Score Left: 0.666667
  • Score Right: 0.33333
  • Split happening: Yes

Explanation

Other than XGBoost Yggdrasil does not seem to carry through the gain of the previous split. It seems to get set to 0 before each new split is evaluated.
We think this is why we found Yggdrasil trees to grow larger than XGBoost. Yggdrasil sets the score to 0 on every iteration, while XGBoost tries to find splits that get a better score than the one of the parent node.
The NodeCondition is initialized every time the split function is called.
And a new NodeCondition defaults to 0.
In this case this is no issue but it may or may not be an issue for more complex problems.

Comparison

We tried to find some difference in performance on different datasets. These datasets were small toy datasets so far.

Setup

XGBoost was running on MAC Os Laptop.

  • 2,6 GHz 6-Core Intel Core i7
  • 16 GB 2667 MHz DDR4
  • AMD Radeon Pro 5300M 4 GB Intel UHD Graphics 630 1536 MB

TFDF was running von GoogleColab and Multipass.

Multipass:

  • Ubuntu 20.04.3 LTS
  • 14.4G Disk
  • 3.8G Memory

Datasets

Sklearn - Breast Cancer Dataset

  • Classes: 2
  • Samples per class: 212 (M), 357 (B)
  • Samples total: 569
  • Dimensionality: 30
  • Features: real, positive

Sklearn - Digit Dataset

  • Classes: 10
  • Samples per class: 180
  • Samples total: 1797
  • Dimensionality: 64
  • Features: integers 0-16

Both dataset were splitted with sklearns train_test_split with test_size=0.2 and random_state=42

Comparison

The datasets were loaded with sklearn and compared with different configurations of both models. We tried first to make them as equal as possible.

Parameter

Yggdrasil

dt_kwargs_base = {
    'num_trees':1000,
    'growing_strategy':"BEST_FIRST_GLOBAL",
    'max_depth':6,
    'use_hessian_gain':True,
    'sorting_strategy':"IN_NODE",
    'shrinkage':1.,
    'subsample':1.,
    'sampling_method': 'RANDOM',
    'l1_regularization':1.,
    'l2_regularization':1.,
    'l2_categorical_regularization':1.,
    'num_candidate_attributes': -1,
    'num_candidate_attributes_ratio': -1.,
    'min_examples':1,
    'validation_ratio':0.,
    'early_stopping':"NONE",
    'in_split_min_examples_check':False,
    'max_num_nodes': -1,
    'verbose': 0,
}

XGBoost

gb_kwargs = {
        'n_estimators': 1000,
        'max_depth': 6,
        'colsample_bytree': 1.,
        'colsample_bynode': 1.,
        'colsample_bylevel': 1.,
        'use_label_encoder': False,
        'reg_lambda': 1.,
        'reg_alpha': 1.,
        'min_child_weight': 0,
        'min_split_loss': 0,
        'max_delta_step': 0,
        'base_score': 0.5,
        'learning_rate': 1.,
        'tree_method': 'exact',
        'booster': 'gbtree',
        'nthread': 1,
        'eval_metric': eval_metric,
        'objective': objective,
        'subsample': 1,
        'verbosity': 0,
        'validate_parameters': False,
        'scale_pos_weight': 1,
        'refresh_leaf': 1,
        #'early_stopping_rounds': 10,
    }
Model/Dataset Breast Cancer Digits
XGBoost 0.1072 0.1606
Yggdrasil 0.0944 0.2350

Conclusion

So far we are not sure if the different approaches of the algorithms is making a huge difference in performance of the model. After tweaking the hyperparameter Yggdrasil was able to perform equally as good or even better than XGBoost. So I wanted to reach out to you, if you can (un)verify our observation and give further insight.

Thanks a lot and best Regards

Timo

Decision forests prediction question

Hi,

Is the generated Yggdrasil decision forests model the same format as other tf models?
Could I use https://github.com/galeone/tfgo and call predict from a golang app?

I run into an issue with bazel when building the standalone example - any idea what could be the issue?

root@efc8844082ba:/notebooks/yggdrasil-decision-forests# uname -a
Linux efc8844082ba 5.10.103-0-virt #1-Alpine SMP Tue, 08 Mar 2022 10:06:11 +0000 x86_64 x86_64 x86_64 GNU/Linux
root@efc8844082ba:/notebooks/yggdrasil-decision-forests# 
root@efc8844082ba:/notebooks/yggdrasil-decision-forests# 
root@efc8844082ba:/notebooks/yggdrasil-decision-forests# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

root@3ac4b2a4ad3d:/notebooks/yggdrasil-decision-forests# uname -a
Linux efc8844082ba 5.10.103-0-virt #1-Alpine SMP Tue, 08 Mar 2022 10:06:11 +0000 x86_64 x86_64 x86_64 GNU/Linux

# root@3ac4b2a4ad3d:/notebooks/yggdrasil-decision-forests# bazel --version
bazel 5.1.1

root@3ac4b2a4ad3d:/notebooks/yggdrasil-decision-forests# bazel build //yggdrasil_decision_forests/cli:all --config=linux_cpp17 --config=linux_avx2
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Reading rc options for 'build' from /notebooks/yggdrasil-decision-forests/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec --incompatible_restrict_string_escapes=false
ERROR: --incompatible_restrict_string_escapes=false :: Unrecognized option: --incompatible_restrict_string_escapes=false

Here is how I install bazel: https://docs.bazel.build/versions/main/install-ubuntu.html#19

how to fix the issue:
disable this line: https://github.com/google/yggdrasil-decision-forests/blob/main/.bazelrc#L43

but than I got his error:

root@efc8844082ba:/notebooks/yggdrasil-decision-forests# bazel build //yggdrasil_decision_forests/cli:all --config=linux_cpp17 --config=linux_avx2
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=96
INFO: Reading rc options for 'build' from /notebooks/yggdrasil-decision-forests/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /notebooks/yggdrasil-decision-forests/.bazelrc:
  'build' options: -c opt --spawn_strategy=standalone --announce_rc --noincompatible_strict_action_env --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --define=grpc_no_ares=true --color=yes
INFO: Found applicable config definition build:linux_cpp17 in file /notebooks/yggdrasil-decision-forests/.bazelrc: --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=linux
INFO: Found applicable config definition build:linux in file /notebooks/yggdrasil-decision-forests/.bazelrc: --copt=-fdiagnostics-color=always --copt=-w --host_copt=-w
INFO: Found applicable config definition build:linux_avx2 in file /notebooks/yggdrasil-decision-forests/.bazelrc: --copt=-mavx2
DEBUG: /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/repo.bzl:108:14: 
Warning: skipping import of repository 'com_google_absl' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/repo.bzl:108:14: 
Warning: skipping import of repository 'farmhash_archive' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/repo.bzl:108:14: 
Warning: skipping import of repository 'com_google_protobuf' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/repo.bzl:108:14: 
Warning: skipping import of repository 'com_google_googletest' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/repo.bzl:108:14: 
Warning: skipping import of repository 'zlib' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/repo.bzl:108:14: 
Warning: skipping import of repository 'rules_cc' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/repo.bzl:108:14: 
Warning: skipping import of repository 'rules_python' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/repo.bzl:108:14: 
Warning: skipping import of repository 'bazel_skylib' because it already exists.
INFO: Repository local_execution_config_python instantiated at:
  /notebooks/yggdrasil-decision-forests/WORKSPACE:38:4: in <toplevel>
  /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/tensorflow/workspace2.bzl:1108:19: in workspace
  /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/tensorflow/workspace2.bzl:84:27: in _tf_toolchains
  /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/tf_toolchains/toolchains/remote_config/configs.bzl:6:28: in initialize_rbe_configs
  /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/tf_toolchains/toolchains/remote_config/rbe_config.bzl:158:27: in _tensorflow_local_config
Repository rule local_python_configure defined at:
  /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/py/python_configure.bzl:275:41: in <toplevel>
ERROR: An error occurred during the fetch of repository 'local_execution_config_python':
   Traceback (most recent call last):
	File "/root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/py/python_configure.bzl", line 213, column 39, in _create_local_python_repository
		numpy_include = _get_numpy_include(repository_ctx, python_bin) + "/numpy"
	File "/root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/py/python_configure.bzl", line 187, column 19, in _get_numpy_include
		return execute(
	File "/root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/remote_config/common.bzl", line 219, column 13, in execute
		fail(
Error in fail: Problem getting numpy include path.
OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
Is numpy installed?
ERROR: /notebooks/yggdrasil-decision-forests/WORKSPACE:38:4: fetching local_python_configure rule //external:local_execution_config_python: Traceback (most recent call last):
	File "/root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/py/python_configure.bzl", line 213, column 39, in _create_local_python_repository
		numpy_include = _get_numpy_include(repository_ctx, python_bin) + "/numpy"
	File "/root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/py/python_configure.bzl", line 187, column 19, in _get_numpy_include
		return execute(
	File "/root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/third_party/remote_config/common.bzl", line 219, column 13, in execute
		fail(
Error in fail: Problem getting numpy include path.
OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
Is numpy installed?
INFO: Repository go_sdk instantiated at:
  /notebooks/yggdrasil-decision-forests/WORKSPACE:42:4: in <toplevel>
  /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/org_tensorflow/tensorflow/workspace0.bzl:117:20: in workspace
  /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/com_github_grpc_grpc/bazel/grpc_extra_deps.bzl:36:27: in grpc_extra_deps
  /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/io_bazel_rules_go/go/toolchain/toolchains.bzl:379:28: in go_register_toolchains
  /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/io_bazel_rules_go/go/private/sdk.bzl:65:21: in go_download_sdk
Repository rule _go_download_sdk defined at:
  /root/.cache/bazel/_bazel_root/e69e42dd9f08c8f44fd8644c44ecd3fd/external/io_bazel_rules_go/go/private/sdk.bzl:53:35: in <toplevel>
ERROR: Analysis of target '//yggdrasil_decision_forests/cli:all_file_systems' failed; build aborted: Problem getting numpy include path.
OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
Is numpy installed?
INFO: Elapsed time: 49.501s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (11 packages loaded, 15 targets configured)
    currently loading: @bazel_tools//tools/python ... (2 packages)
    Fetching https://dl.google.com/go/go1.12.5.linux-amd64.tar.gz; 1,613,824B

my dockerfile (to reproduce the hazel error):

# image is based on:
# https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dockerfiles/dockerfiles/cpu.Dockerfile

ARG UBUNTU_VERSION=20.04

FROM ubuntu:${UBUNTU_VERSION} as base

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y curl

# See http://bugs.python.org/issue19846
ENV LANG C.UTF-8

RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip

RUN python3 -m pip --no-cache-dir install --upgrade \
    "pip<20.3" \
    setuptools

# Some TF tools expect a "python" binary
RUN ln -s $(which python3) /usr/local/bin/python

# Options:
#   tensorflow
#   tensorflow-gpu
#   tf-nightly
#   tf-nightly-gpu
# Set --build-arg TF_PACKAGE_VERSION=1.11.0rc0 to install a specific version.
# Installs the latest version by default.
ARG TF_PACKAGE=tensorflow
ARG TF_PACKAGE_VERSION=
RUN python3 -m pip install --no-cache-dir ${TF_PACKAGE}${TF_PACKAGE_VERSION:+==${TF_PACKAGE_VERSION}}

# install tensorflow_decision_forests and numpy
RUN pip3 install tensorflow_decision_forests --upgrade
RUN python3 -m pip install numpy

# install bazel
RUN apt install apt-transport-https curl gnupg -y
RUN curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor > bazel.gpg
RUN mv bazel.gpg /etc/apt/trusted.gpg.d/
RUN echo "deb [arch=amd64] https://storage.googleapis.com/bazel-apt stable jdk1.8" | tee /etc/apt/sources.list.d/bazel.list
RUN apt update && apt install bazel -y
RUN apt update && apt full-upgrade -y
RUN apt install bazel-1.0.0 -y
#RUN ln -s /usr/bin/bazel-1.0.0 /usr/bin/bazel
RUN bazel --version

# WORKDIR /tf
# VOLUME ["/tf"]

COPY bashrc /etc/bash.bashrc
RUN chmod a+rwx /etc/bash.bashrc

cc @achoum

Go model serving does not support DISCRETIZED_NUMERICAL

Using a column guide of:

column_guides {
  type: DISCRETIZED_NUMERICAL
  column_name_pattern: "^dialtimebin$"
}

Results in the go code issuing:

Error: Non supported feature dialtimebin with type DISCRETIZED_NUMERICAL

Is there a chance that this feature type will be supported from Go?

This field is a an hourly time bin 0-23 - is that the correct feature type to use?

Converting from Scikit-Learn or ONNX models

We have trained tree ensemble models using Scikit-Learn (converted to ONNX) and XGBoost.
Is there a way to load the model into yggdrasil so that we can leverage the fast serving engine for inference?

Not able to use yggdrasil-decision-forests as a dependency through Bazel

I am trying to use yggdrasil-decision-forests as a C++ dependency in other project, that uses Bazel. I am using the suggestions in the documentation, mainly:

cc_library(
    name = "models",
    srcs = ["models.cpp"],
    hdrs = ["models.h"],
    deps = [
        "@ydf//yggdrasil_decision_forests/model:all_models",
        "@ydf//yggdrasil_decision_forests/learners:all_learners",
    ]
)

In the BUILD file and:

http_archive(
    name = "ydf",
    strip_prefix = "yggdrasil_decision_forests-master",
    urls = ["https://github.com/google/yggdrasil_decision_forests/archive/master.zip"],
)

load("@ydf//yggdrasil_decision_forests:library.bzl", ydf_load_deps = "load_dependencies")
ydf_load_deps(repo_name = "@ydf")

In the WORKSPACE file.

Nonetheless, I am getting the following error, when making bazel build models:

ERROR: /home/jfilipe/Repos/vvc-early-term-models/deploy/WORKSPACE:9:1: name 'http_archive' is not defined
ERROR: error loading package '': Encountered error while reading extension file 'yggdrasil_decision_forests/library.bzl': no such package '@ydf//yggdrasil_decision_forests': error loading package 'external': Could not load //external package

Go module buried in this repo prevents the module importing properly

I had the same issues when I kept the Go runtime for ANTLR in a buried subdirectory in the repo. This prevents the go get command from dealing nicely with the package as:

The go get will download the whole repo including all the non-go stuff in to the package cache

If you tag the repo with a release tag, the go get will not resolve it and will make up a tag name and not use the release tag. So tagging at say 1.0.0 will not show 1.0.0 in the go.mod file for a project using the module. It will show something like:

require (
	github.com/google/yggdrasil-decision-forests/yggdrasil_decision_forests/port/go v0.0.0-20230710100126-8f25eced9d1b
)

Which means it is impossible to know at a glance which version of the code is being used. Also, the manufactured tag will change if some other part of the repo is updated.

The go language really expects that the code is in its own repo, with the code at the top level, with the go.mod there too. That was the only way I could make everything work for the go tooling.

platforms dependency

Hi,
i was following installation instruction but build failed with this error:

 ERROR: .../external/com_google_absl/absl/BUILD.bazel:84:15: no such target '@platforms//cpu:wasm32': target 'wasm32' not declared in package 'cpu' defined by .../external/platforms/cpu/BUILD and referenced by '@com_google_absl//absl:platforms_wasm32'
ERROR: While resolving configuration keys for @com_google_absl//absl:wasm_3: Analysis failed
ERROR: While resolving configuration keys for @com_google_absl//absl/synchronization:synchronization: Analysis failed
ERROR: Analysis of target '//yggdrasil_decision_forests/cli:cli_test' failed; build aborted: Analysis failed

but it resolved by adding this to yggdrasil/yggdrasil-decision-forests/third_party/absl/workspace.bzl:

http_archive(
        name = "platforms",
        sha256 = "b601beaf841244de5c5a50d2b2eddd34839788000fa1be4260ce6603ca0d8eb7",
        strip_prefix = "platforms-98939346da932eef0b54cf808622f5bb0928f00b",
        urls = ["https://github.com/bazelbuild/platforms/archive/98939346da932eef0b54cf808622f5bb0928f00b.zip"],
    )

ububtu 18.04, bazel 4.0.0
bazel build //yggdrasil_decision_forests/cli/...:all --config=linux_cpp17 --config=linux_avx2 --repo_env=CC=gcc-9 --copt=-mavx2 --config=use_tensorflow_io --define=no_absl_statusor=1

is it because of some dependency changes or have i done sth wrong?

Is hessian gain supported or not? on what loss?

Yggdrasil is such a nice project as it has such versatile support on GBT.
One intriguing is that Yggdrasil supports use_hessian_gain as a toggle for different training style.

I'm currently looking into implementing my own Loss function for certain tasks, yet question raises when dealing with Hessian.

  • Is hessian essential for a new loss func implemented
  • Is use_hessian_gain behavior actually defaulted to what's set in Loss, or really just from use_hessian_gain=false proto?
  • if wanted, how to properly support hessian training
    • Is Yggdrasil training on gain = G^2/(H+Ξ»)

S: use_hessian_gain "Available for all losses except regression."

#### [use_hessian_gain](../yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.proto?q=symbol:use_hessian_gain)
- **Type:** Categorical **Default:** false **Possible values:** true, false
- Use true, uses a formulation of split gain with a hessian term i.e.
optimizes the splits to minimize the variance of "gradient / hessian.
Available for all losses except regression.

A: hessian available for LogLikelihood, not MSE

A.1 . Error when use_hessian_gain = true and hessian_col_idx not set

if (config.gbt_config->use_hessian_gain() &&
gradients.front().hessian_col_idx == -1) {
return absl::InvalidArgumentError(
"Loss does not support hessian optimization");
}

A.2 gradient.hessian_col_idx set iff hessian = true

if (loss_shape.has_hessian) {
const auto hessian_col_name = HessianColumnName(gradient_idx);
dataset::proto::Column hessian_col_spec;
hessian_col_spec.set_name(hessian_col_name);
hessian_col_spec.set_type(dataset::proto::ColumnType::NUMERICAL);
// Note: These values will be set correctly before use.
gradient.hessian =
dynamic_cast<dataset::VerticalDataset::NumericalColumn*>(
gradient_dataset->AddColumn(hessian_col_spec).value())
->mutable_values();
gradient.hessian_col_idx =
gradient_dataset->ColumnNameToColumnIdx(hessian_col_name);

A.3 has hessian = false for MSE

class MeanSquaredErrorLoss : public AbstractLoss {
public:
MeanSquaredErrorLoss(
const proto::GradientBoostedTreesTrainingConfig& gbt_config,
model::proto::Task task, const dataset::proto::Column& label_column);
absl::Status Status() const override;
LossShape Shape() const override {
return LossShape{/*.gradient_dim =*/1, /*.prediction_dim =*/1,
/*.has_hessian =*/false};
};

A.4 has hessian = use_hessian_gain for LogLikelihoodLoss

class BinomialLogLikelihoodLoss : public AbstractLoss {
public:
BinomialLogLikelihoodLoss(
const proto::GradientBoostedTreesTrainingConfig& gbt_config,
model::proto::Task task, const dataset::proto::Column& label_column);
absl::Status Status() const override;
LossShape Shape() const override {
return LossShape{/*.gradient_dim =*/1,
/*.prediction_dim =*/1,
/*.has_hessian =*/gbt_config_.use_hessian_gain()};
};

B hessian available for Regression, not Classification

B.1 Error when use_hessian_gain on Task::CLASSIFICATION

case model::proto::Task::CLASSIFICATION: {
if (internal_config.use_hessian_gain) {
return absl::InternalError("Expect use_hessian_gain=false");
}

B.2 No error when use_hessian_gain on Task::REGRESSION

case model::proto::Task::REGRESSION: {
if (internal_config.use_hessian_gain) {
RegressionHessianLabelStats label_stat(
train_dataset

C hessian only trainable on Regression task with LogLikelihood Loss

C.1 : G, H, W (sum of gradience hessian and weights) available for LogLikelihoodLoss, nothing else

if (gbt_config_.use_hessian_gain()) {
auto* reg = node->mutable_node()->mutable_regressor();
reg->set_sum_gradients(numerator);
reg->set_sum_hessians(denominator);
reg->set_sum_weights(sum_weights);

C.2: G, H, W only accessed on Task::REGRESSION, nothing else

case model::proto::Task::REGRESSION: {
if (internal_config.use_hessian_gain) {
RegressionHessianLabelStats label_stat(
train_dataset
.ColumnWithCast<dataset::VerticalDataset::NumericalColumn>(
config_link.label())
->values(),
train_dataset
.ColumnWithCast<dataset::VerticalDataset::NumericalColumn>(
internal_config.hessian_col_idx)
->values());
label_stat.sum_gradient = parent.regressor().sum_gradients();
label_stat.sum_hessian = parent.regressor().sum_hessians();
label_stat.sum_weights = parent.regressor().sum_weights();

D: MSE = Regression+Ranking, LogLikelihood = Classification

D.1: MSE only available for Regression or Ranking, not Classification

absl::Status MeanSquaredErrorLoss::Status() const {
if (task_ != model::proto::Task::REGRESSION &&
task_ != model::proto::Task::RANKING) {
return absl::InvalidArgumentError(
"Mean squared error loss is only compatible with a "
"regression or ranking task");
}

D.2 LogLikelihood Loss only available for Classification

absl::Status BinomialLogLikelihoodLoss::Status() const {
if (task_ != model::proto::Task::CLASSIFICATION)
return absl::InvalidArgumentError(
"Binomial log likelihood loss is only compatible with a "
"classification task");

Conclusion draw from S, A, B, C, D:

S,B => conflicts
either use_hessian_gain is documented wrong, or it could be feature not implemented yet, or there is something really wrong going on here.

C,D => hessian not trainable
as LogLikelihood Algo does not make much sense in Regression tasks, It's much likely C is invalid

A, D => E: hessian available for Classification, not Regression/Ranking
B, E => conflicts
as B also conflicts in 1, it's likely B is the actual False one.

Don't pollute my home !!

Hi.

I compiled yggdrasil from source. And launched the beginner example.

It created a folder yggdrasil_decision_forests_beginner in my $HOME.

I guess this is a byproduct of the docker approach to builds. Ideally, what is built and tested in a folder dedicated to the project should stay in that project and not escape it.

(I mean, the bazel build is so cool, so local, so "hermetic"... That just kills the magic...)

How to compile examples out of bazel

I am working on integrating yggdrasil-decision-forests in a C++ application, which is compiled via CMake.

As a first steps, I am trying to compile the full example pointed in https://ydf.readthedocs.io/en/latest/cpp_serving.html, getting the following error:

➜  yggdrasil-decision-forests git:(main) βœ— g++ examples/beginner.cc --std=c++17 -I./  
In file included from examples/beginner.cc:46:
./yggdrasil_decision_forests/dataset/data_spec.h:32:10: fatal error: 'yggdrasil_decision_forests/dataset/data_spec.pb.h' file not found
#include "yggdrasil_decision_forests/dataset/data_spec.pb.h"
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

how to deal with the pb.h?

bazel build error

hi,

I try to build Yggdrasil with the latest main branch and run into an error - any idea how to get it build?

root@641936fb6db1:~/yggdrasil-decision-forests# bazel build //yggdrasil_decision_forests/cli:all --config=linux_cpp17 --config=linux_avx2                              
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=168
INFO: Reading rc options for 'build' from /home/developer/yggdrasil-decision-forests/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /home/developer/yggdrasil-decision-forests/.bazelrc:
  'build' options: -c opt --spawn_strategy=standalone --announce_rc --noincompatible_strict_action_env --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --define=grpc_no_ares=true --color=yes
INFO: Found applicable config definition build:linux_cpp17 in file /home/developer/yggdrasil-decision-forests/.bazelrc: --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=linux
INFO: Found applicable config definition build:linux in file /home/developer/yggdrasil-decision-forests/.bazelrc: --copt=-fdiagnostics-color=always --copt=-w --host_copt=-w
INFO: Found applicable config definition build:linux_avx2 in file /home/developer/yggdrasil-decision-forests/.bazelrc: --copt=-mavx2
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/093ed77f7d50f75b376f40a71ea86e08cedb8b80.tar.gz failed: class java.io.FileNotFoundException GET returned 404 Not Found
INFO: Repository llvm-raw instantiated at:
  /home/developer/yggdrasil-decision-forests/WORKSPACE:40:4: in <toplevel>
  /home/developer/.cache/bazel/_bazel_root/24e8bb26857d60bf6cc54958294ec961/external/org_tensorflow/tensorflow/workspace3.bzl:42:9: in workspace
  /home/developer/.cache/bazel/_bazel_root/24e8bb26857d60bf6cc54958294ec961/external/org_tensorflow/third_party/llvm/workspace.bzl:10:20: in repo
  /home/developer/.cache/bazel/_bazel_root/24e8bb26857d60bf6cc54958294ec961/external/org_tensorflow/third_party/repo.bzl:128:21: in tf_http_archive
Repository rule _tf_http_archive defined at:
  /home/developer/.cache/bazel/_bazel_root/24e8bb26857d60bf6cc54958294ec961/external/org_tensorflow/third_party/repo.bzl:81:35: in <toplevel>
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/llvm/llvm-project/archive/1cb299165c859533e22f2ed05eb2abd5071544df.tar.gz failed: class java.io.FileNotFoundException GET returned 404 Not Found
WARNING: Download from https://github.com/llvm/llvm-project/archive/1cb299165c859533e22f2ed05eb2abd5071544df.tar.gz failed: class java.io.IOException Read timed out
ERROR: An error occurred during the fetch of repository 'llvm-raw':
   Traceback (most recent call last):
        File "/home/developer/.cache/bazel/_bazel_root/24e8bb26857d60bf6cc54958294ec961/external/org_tensorflow/third_party/repo.bzl", line 64, column 33, in _tf_http_archive_impl
                ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Error downloading [https://storage.googleapis.com/mirror.tensorflow.org/github.com/llvm/llvm-project/archive/1cb299165c859533e22f2ed05eb2abd5071544df.tar.gz, https://github.com/llvm/llvm-project/archive/1cb299165c859533e22f2ed05eb2abd5071544df.tar.gz] to /home/developer/.cache/bazel/_bazel_root/24e8bb26857d60bf6cc54958294ec961/external/llvm-raw/temp10191920695629813942/1cb299165c859533e22f2ed05eb2abd5071544df.tar.gz: Read timed out
ERROR: /home/developer/yggdrasil-decision-forests/WORKSPACE:40:4: fetching _tf_http_archive rule //external:llvm-raw: Traceback (most recent call last):
        File "/home/developer/.cache/bazel/_bazel_root/24e8bb26857d60bf6cc54958294ec961/external/org_tensorflow/third_party/repo.bzl", line 64, column 33, in _tf_http_archive_impl
                ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Error downloading [https://storage.googleapis.com/mirror.tensorflow.org/github.com/llvm/llvm-project/archive/1cb299165c859533e22f2ed05eb2abd5071544df.tar.gz, https://github.com/llvm/llvm-project/archive/1cb299165c859533e22f2ed05eb2abd5071544df.tar.gz] to /home/developer/.cache/bazel/_bazel_root/24e8bb26857d60bf6cc54958294ec961/external/llvm-raw/temp10191920695629813942/1cb299165c859533e22f2ed05eb2abd5071544df.tar.gz: Read timed out
ERROR: no such package '@llvm-raw//utils/bazel': java.io.IOException: Error downloading [https://storage.googleapis.com/mirror.tensorflow.org/github.com/llvm/llvm-project/archive/1cb299165c859533e22f2ed05eb2abd5071544df.tar.gz, https://github.com/llvm/llvm-project/archive/1cb299165c859533e22f2ed05eb2abd5071544df.tar.gz] to /home/developer/.cache/bazel/_bazel_root/24e8bb26857d60bf6cc54958294ec961/external/llvm-raw/temp10191920695629813942/1cb299165c859533e22f2ed05eb2abd5071544df.tar.gz: Read timed out
INFO: Elapsed time: 182.660s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)

it looks like its related to: tensorflow/tensorflow#56422

Compiling on Mac

Hi all,

I am currently experimenting with Yggdrasil and trying to compile it on macOS Big Sur. On Linux it runs without problems.

My setup:

  • Apple clang version 13.0.0 (clang-1300.0.29.3) or gcc/g++ (Homebrew GCC 9.4.0) 9.4.0
  • bazel 4.0.0 (did tried bazelisk aswell)
  • Python 3.9.5
  • numpy Version: 1.21.3

I tried different flags out of the .bashrc for config macos and linux.

Whatever I do the compiler crashes with following error:

yggdrasil_decision_forests/utils/bitmap.cc:198:12: error: out-of-line definition of 'BitWriter' does not match any declaration in 'yggdrasil_decision_forests::utils::bitmap::BitWriter'
BitWriter::BitWriter(const uint64_t size, std::string* bitmap)

In other branches you have different versions of the BitWriter Class. The ones not in main are adding ~BitWriter(); in Public of BitWriter class in bitmap.h.
I added this but got same result.

So can you tell me. what I am doing wrong? Is there a way to get it running on mac?

Thanks so much for response.

Best regards,

Timo

Edit 1:

We seem to have found the error.

Since in bitmap.h the declaration is as follows:

BitWriter(size_t size, std::string* bitmap);

We have another type in bitmap.cc:

BitWriter::BitWriter(const uint64_t size, std::string* bitmap)

After changing uint64_t size to size_t size in bitmap.cc, compiling on macOS worked.

performance issue with random forests

Hi,
I'm running a RANDOM_FOREST model trained in tf_df, by using yggdrasil c++ api and inference time taking about 50 ΞΌs, But as you said it probably shouldn't take more than 10 ΞΌs.

Also running in large batches(vs batch size =1) or using --copt=-mavx2 doesn't make a difference at all!
I've used benchmark_inference tool and result was the same.
Another interesting observation was difference between min & max execution time per instance,
exec times for 10 run:
########################################
0 max 2133059 min 17293 avg 52250
1 max 1054634 min 16696 avg 52982
2 max 1038110 min 14949 avg 45468
3 max 1068611 min 16752 avg 53064
4 max 1657415 min 16790 avg 54514
5 max 1125537 min 16432 avg 53145
6 max 1939590 min 17591 avg 74354
7 max 2997816 min 17284 avg 70325
8 max 1064365 min 16554 avg 56063
9 max 1044182 min 16145 avg 51429
########################################

even if i ignore some of first execution times (for cache miss) the variance between exec times are still high.
########################################
0 max 1488318 min 15841 avg 51085
1 max 955384 min 16501 avg 45567
2 max 928377 min 16370 avg 44606
3 max 1018261 min 15124 avg 44204
4 max 1429345 min 17299 avg 79810
5 max 1628887 min 17539 avg 80997
6 max 2126679 min 16487 avg 67346
7 max 1058939 min 16616 avg 53941
8 max 1098449 min 16242 avg 48047
9 max 1103341 min 16659 avg 53750
########################################

Wondering if there is a problem in model or inference setup or this is the best performance i can get.
model spec:
RANDOM_FOREST
300 root(s), 618972 node(s), and 28 input feature(s).
RandomForestOptPred engine

compiled with this flags:
--config=linux_cpp17 --config=linux_avx2 --repo_env=CC=gcc-9 --copt=-mavx2

system spec:
Ubuntu 18.04.4
cpu Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
on esxi virtual machine

features question

Hi,

how are categorical features implemented in yggdrasil-df - are they kept as strings or converted? Anything i need to know of when using categorical features compared to numerical features? For example If I have url's as a categorical feature with 10k unique strings or more...

from the docs - are they really stored as integers?

CATEGORICAL: A categorical value. Generally for a type/class with a finite set of possible values
without ordering.
For example, the color RED in the set {RED, BLUE, GREEN}. Can be a string or an integer.
Internally, categorical values are stored as int32 and should therefore be smaller than ~2B

Thanks

Cannot install in pip3.11

Hi I am trying to install tensorflow_decision_forests on my Mac for python3.11 and got the following error

➜  ML-COLGEN-Hackathon-June2023 git:(main) βœ— pip3.11 install tensorflow_decision_forests --upgrade
ERROR: Could not find a version that satisfies the requirement tensorflow_decision_forests (from versions: none)
ERROR: No matching distribution found for tensorflow_decision_forests

I use

➜  yggdrasil-decision-forests git:(main) βœ— pip3.11 --version
pip 23.1.2 from /opt/homebrew/lib/python3.11/site-packages/pip (python 3.11)

any clue? It works nicely in pip3.10

Issue with use of BINOMIAL_LOG_LIKELIHOOD loss

Hi

I have moved from 0.2.3 to main (latest) and have now found an issue with my code:

[INFO 2023-01-26T12:20:50.093951+00:00 gradient_boosted_trees.cc:452] Default loss set to BINOMIAL_LOG_LIKELIHOOD
[FATAL 2023-01-26T12:20:50.0948034+00:00 YggdrasilWrapper.cpp:177] INVALID_ARGUMENT: No class registered with key "BINOMIAL_LOG_LIKELIHOOD" in the class pool "class yggdrasil_decision_forests::model::gradient_boosted_trees::AbstractLoss". Registered classes are "". Add as a dependency the cc_library rule that defines this class in your BUILD file

I can see that the loss functionality has changed quite a bit since 0.2.3 but I'm not sure what I'm missing to get this to work again.

Thanks!

Saving and loading models trained using quickscorer

Hi

I'm working on saving and loading models to file. I have a GradientBoostedTreesBinaryClassificationQuickScorerExtended model and am using the model_library SaveModel/LoadModel to save/load it to/from a directory. When doing this, the library creates a directory with 2 files, a header and a data_spec. I'm not sure whether these contain the whole trained model, but I suspect not, because when I load the model (and cast it to GradientBoostedTreeModel and create the specialized model GradientBoostedTreesBinaryClassificationQuickScorerExtended), I seem to have lost the actual trees as shown when calling DescriptionAndStatistics (the metadata is still there though). Also, this loaded model is not able to predict anymore. Do these SaveModel and LoadModel methods work correctly with these kind of specialized models or should I be taking a different approach here?

Thanks for your help.

getting protobuf downgrading issue

######################$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$<<< tensorflow model
#Copy and execute the following code in a new Google Colab notebook launch to run the model.
!python -m pip install tensorflow tensorflow_decision_forests -U -qq

# Transfer the model from Google Drive to Colab
from google.colab import drive
drive.mount("/content/gdrive")
!cp "/content/gdrive/My Drive/simple_ml_for_sheets/Ice height from bands" ydf_model
  
# Prepare and load the model with TensorFlow
import tensorflow as tf
import tensorflow_decision_forests as tfdf

tfdf.keras.yggdrasil_model_to_keras_model("ydf_model", "tfdf_model")
model = tf.keras.models.load_model("tfdf_model")
######################$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$<<< tensorflow model

output from google colab

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 462.5/462.5 KB 18.2 MB/s eta 0:00:00
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-metadata 1.12.0 requires protobuf<4,>=3.13, but you have protobuf 4.22.1 which is incompatible.
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-10-257986a1557a>](https://localhost:8080/#) in <module>
     17 # Prepare and load the model with TensorFlow
     18 import tensorflow as tf
---> 19 import tensorflow_decision_forests as tfdf
     20 
     21 tfdf.keras.yggdrasil_model_to_keras_model("ydf_model", "tfdf_model")

8 frames
[/usr/local/lib/python3.9/dist-packages/google/protobuf/descriptor.py](https://localhost:8080/#) in __new__(cls, name, index, number, type, options, serialized_options, create_key)
    794                 type=None,  # pylint: disable=redefined-builtin
    795                 options=None, serialized_options=None, create_key=None):
--> 796       _message.Message._CheckCalledFromGeneratedFile()
    797       # There is no way we can build a complete EnumValueDescriptor with the
    798       # given parameters (the name of the Enum is not known, for example).

TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

Serialization of Yggdrasil models

Hi guys

I have come across a couple of methods in the library to LoadModel and SaveModel but I haven't seen any interface provided for serialization of a model to string. I can save to a directory and serialize from there but is there a simpler way to do this?

Many thanks.

Windows Build Fails - Compiling .cc files results in syntax error

My build on Windows 11 w/VC 2019 is failing with the errors below when compiling various .cc files (convert_dataset.cc, dataset_cache_reader.cc, distribute.cc etc)

external/com_google_absl\absl/crc/internal/crc32_x86_arm_combined_simd.h(64): error C2061: syntax error: identifier '__m128i_u'
external/com_google_absl\absl/crc/internal/crc32_x86_arm_combined_simd.h(81): error C4430: missing type specifier - int assumed. Note: C++ does not support default-int
external/com_google_absl\absl/crc/internal/crc32_x86_arm_combined_simd.h(81): error C2143: syntax error: missing ',' before '*'
external/com_google_absl\absl/crc/internal/crc32_x86_arm_combined_simd.h(149): error C4430: missing type specifier - int assumed. Note: C++ does not support default-int
external/com_google_absl\absl/crc/internal/crc32_x86_arm_combined_simd.h(149): error C2143: syntax error: missing ',' before '*'
external/com_google_absl\absl/crc/internal/crc32_x86_arm_combined_simd.h(149): error C2065: 'src': undeclared identifier

I've tried compiling 1.3 and 1.4 with similar results. Any idea how to resolve this crc32_x86_arm_combined_simd.h error?

XGBoost implementation

Hi all,

I'm trying to implement XGBoost, using gradient_boosted_trees with use_hessian_gain. After some tweaking of the parameter, while working with different datasets on binary classification problem, I cannot replicate the trees and results of xgboost's XGBClassifier; but as far as I understand the code, it should produce the same algorithm.

Did somebody tried this, or is this even possible? If yes can you point me in the direction of how I need to configure GradientBoostedTreesModel and if not is there an implementation of XGBoost planned?

Thanks so much in advance,

Timo

Cannot import ydf from windows vsc

Hi, I tried to install ydf via pip instdall ydf

It shows that I've install a version 0.0.5 of ydf.

when I'm trying to import ydf into the notebook, it pops up error showing as follows:

ImportError: cannot import name 'ydf' from 'ydf.cc' (c:\Users**\AppData\Local\Programs\Python\Python311\Lib\site-packages\ydf\cc_init_.py)

Any idea how I can fix this?

read model information

hi,

i have a model with a /assets folder:

  • 725b7a901c7141c3data_spec.pb
  • 725b7a901c7141c3done
  • 725b7a901c7141c3header.pb
  • 725b7a901c7141c3nodes-00000-of-00001
  • 725b7a901c7141c3random_forest_header.pb

im trying to read some infos about the generated model:

package main

import (
	"log"
	"fmt"
	"github.com/google/yggdrasil-decision-forests/yggdrasil_decision_forests/port/go/model/io"
	"github.com/google/yggdrasil-decision-forests/yggdrasil_decision_forests/port/go/serving"
	gbt "github.com/google/yggdrasil-decision-forests/yggdrasil_decision_forests/port/go/model/randomforest"
)

func main() {
	modelPath := "../foo/tf-model/assets"
	model, err := io.LoadModel(modelPath)
	if err != nil {
		log.Fatalf("Cannot load model. %v", err)
	}

	rfModel := model.(*gbt.Model)

	varImportance := rfModel.Header().GetPrecomputedVariableImportances()
	for k,v := range varImportance {
		log.Printf("%s, %s\n", k, v)
	}

	// Compile the model for fast infernce.
	// At this point, the "model objet can be discarded.
	engine, err := serving.NewEngine(model)
	if err != nil {
		log.Fatalf("Cannot create serving.NewEngine. %v", err)
	}

	//fmt.Printf("\n%+v\n", engine.Features().NumericalFeatures)
	//fmt.Printf("\n%+v\n", engine.Features().CategoricalFeatures)
	//fmt.Printf("%+v\n", engine.Features().CategoricalSpec)
}

i get this output. how to interprete the information about the feature importance? is there other useful information about the feature importance or something else i can read?

2023/04/12 12:38:00 NUM_AS_ROOT, variable_importances:{attribute_idx:10 importance:1}
2023/04/12 12:38:00 MEAN_MIN_DEPTH, variable_importances:{attribute_idx:3 importance:23.4122039822913} variable_importances:{attribute_idx:14 importance:23.4122039822913} variable_importances:{attribute_idx:13 importance:18.959115431559333} variable_importances:{attribute_idx:9 importance:18.90570766197468} variable_importances:{attribute_idx:7 importance:18.404960169864875} variable_importances:{attribute_idx:6 importance:16.509970172537066} variable_importances:{attribute_idx:4 importance:15.318525526674996} variable_importances:{attribute_idx:12 importance:13.434679300607382} variable_importances:{attribute_idx:8 importance:11.116124885348432} variable_importances:{attribute_idx:2 importance:10.598752013173195} variable_importances:{attribute_idx:0 importance:8.497324194911275} variable_importances:{attribute_idx:5 importance:8.410723442363665} variable_importances:{attribute_idx:1 importance:6.79472494457003} variable_importances:{attribute_idx:11 importance:6.022468096170096} variable_importances:{attribute_idx:10 importance:0}
2023/04/12 12:38:00 NUM_NODES, variable_importances:{attribute_idx:7 importance:28824} variable_importances:{attribute_idx:4 importance:24230} variable_importances:{attribute_idx:12 importance:23355} variable_importances:{attribute_idx:9 importance:16055} variable_importances:{attribute_idx:5 importance:15593} variable_importances:{attribute_idx:11 importance:12692} variable_importances:{attribute_idx:10 importance:4472} variable_importances:{attribute_idx:6 importance:3964} variable_importances:{attribute_idx:13 importance:3655} variable_importances:{attribute_idx:8 importance:2105} variable_importances:{attribute_idx:1 importance:1939} variable_importances:{attribute_idx:2 importance:1360} variable_importances:{attribute_idx:0 importance:218}
2023/04/12 12:38:00 SUM_SCORE, variable_importances:{attribute_idx:5 importance:33101.01888473334} variable_importances:{attribute_idx:11 importance:22825.741772663732} variable_importances:{attribute_idx:7 importance:21561.736182725967} variable_importances:{attribute_idx:12 importance:21390.187111173334} variable_importances:{attribute_idx:4 importance:21388.94658304365} variable_importances:{attribute_idx:10 importance:19679.342864750575} variable_importances:{attribute_idx:9 importance:13148.756573859178} variable_importances:{attribute_idx:1 importance:6424.683870845609} variable_importances:{attribute_idx:6 importance:5866.794061057355} variable_importances:{attribute_idx:2 importance:4778.448671128921} variable_importances:{attribute_idx:8 importance:4520.945368736043} variable_importances:{attribute_idx:13 importance:4238.837772123543} variable_importances:{attribute_idx:0 importance:1949.4544305546806}

is there a way to generate some plot like the following based on the info above? https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_models/Scatter%20Density%20vs.%20Violin%20Plot%20Comparison.html#Layered-violin-plot

also tried to read the HyperparameterOptimizerLogs - but it seems empty - getting nil:

fmt.Printf("%+v\n", rfModel.Header().GetHyperparameterOptimizerLogs())

Thanks

Can decision tree based algorithms always find the best fit?

Hi

I have a bit of a general question: when working with decision tree based algorithms such us Yggdrasil’s GBTs, should we expect that the algorithm will always find a better fit to the data (less error) when using all the available features as compared to when using a subset of them? The reasoning behind this assumption is that a good algorithm will always be able to choose the best features from within a group however big it is, so it’s reasonable to expect that if a subset of features gives the best fit, adding more features will not make the fit worse. It’s question about local versus global mΓ­nima. Any comments would be very appreciated…

Loading big models is slow

Hi,

i am trying to deploy a trained ydf model (tfdf.keras.RandomForestModel) to a serverless FaaS-like inference service in order to save resources (only 2-3 invocations per day).
For this, the model needs to be loaded basically every time a prediction is done. However, loading the model takes more than 30 minutes, which is too long for this usecase. I am not a c++ dev, so my debugging efforts have been relatively limited up to now. Initially i only used the tfdf library, since then i have switched to the ydf python library, with no performance increase (which is probably to be expected since the use the same c++ backend). The only difference is, that the ydf library does not seem to actually load the model fully when calling

loaded_model = ydf.from_tensorflow_decision_forests(model_path)

but rather only when the first predict call is made.

The code

logger.info("predicting...")
live_predictions = loaded_model.predict(live_data)

is producing

2024-03-06 09:03:03.992 | INFO     | __main__:<module>:45 - predicting...
[INFO 24-03-06 09:33:24.2661 EST decision_forest.cc:700] Model loaded with 200 root(s), 2681168 node(s), and 1191 input feature(s).
[INFO 24-03-06 09:33:24.3790 EST abstract_model.cc:1344] Engine "RandomForestGeneric" built

As you can see, it takes ~30 minutes until the model is fully loaded. This happens on my local machine (M2 Air), but also on all other machines i tried (including x86 ones).
If you think that a model with these parameters is supposed to take this long to load, feel free to close this issue, but to me this seems a bit too slow judging by the fact that it is possible to load gigantic models like llama in sub 10 seconds.

Thanks in advance!

Missing whitespace on page /cli_install.html and please use sudo in build_binary_release.sh

Make the build a bit nicer for end-user:

In https://ydf.readthedocs.io/en/latest/cli_install.html, please add a space in INSTALL_DEPENDENCIES=1 BUILD=1./tools/build_binary_release.sh to make it INSTALL_DEPENDENCIES=1 BUILD=1 ./tools/build_binary_release.sh. This makes the build instruction a bit smoother...

And, then, please use sudo in that script when invoking apt. I indeed do not want to run ./tools/build_binary_release.sh as root..

Error compiling CLI in OSX

Hi, I am trying to compile the CLI on my OSX with the following command

bazel build //yggdrasil_decision_forests/cli:all --config=macos 

but I get the following cryptic error

ERROR: /Users/andreacassioli/workspace/yggdrasil-decision-forests/yggdrasil_decision_forests/cli/BUILD:242:8: Executing genrule //yggdrasil_decision_forests/cli:compile_model_for_test failed: (Aborted): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)
[INFOdyld[64991]: missing symbol called
/bin/bash: line 2: 64991 Abort trap: 6           bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/yggdrasil_decision_forests/cli/compile_model --model $TESTDATA_BASEDIR --namespace test_model > bazel-out/darwin_arm64-opt/bin/yggdrasil_decision_forests/cli/generated_model.h

What could be wrong?

Use of designated initializers requires at least '/std:c++20'

Hi

I'm currently trying to build YDF master locally and am running into an unexpected issue:

"error C7555: use of designated initializers requires at least '/std:c++20'"

I say unexpected because, according to what I've read, designated initializers are a C++20 feature. However, YDF seems to allow only for C++14 or C++17 config build options (as found in .bazelrc, in test_bazel.bat, and in the YDF installation page). I can try and modify the code to make it compatible with C++17 if this is what's required. However, I was wondering whether adding to .bazelrc:

build:windows_cpp20 --cxxopt=/std:c++20
build:windows_cpp20 --host_cxxopt=/std:c++20
build:windows_cpp20 --config=windows

and trying to build using C++20 is an acceptable (as in supported) approach or I will run into other potentially more serious problems (I have briefly tried this and come across a constinit error, but if this approach is ok I might as well spend the time trying to fix this error instead of the one above!).

Thanks.

Can Yggdrasil run on 8-bit?

Hi

I am using a GBT model with the quickscorer algorithm. For training, I create a yggdrasil_decision_forests::dataset::VerticalDataset and use the AppendExample interface for which I have to convert the features to string (very slow!). Then I train an AbstractLearner using the dataset. For prediction, I cast the abstract model to GradientBoostedTreesBinaryClassificationQuickScorerExtended and create an ExampleSet which I then use to predict. So far I'm using the SetNumerical interface, which takes float, for creating the example set.

Since all my features come as uint8, I was wondering whether it is possible to use the Yggdrasil library (both training and prediction) in 8-bit mode directly, without having to convert either to string or to float? I haven't found interfaces to do this. If they are not available, is there a way to change/template the code to make this possible? Has anyone done this already? Do you envision problems with this?

Many thanks.

Is the model prediction probability calibrated?

I understand that the sklearn random forest models prediction probabilities are not calibrated and we need to add steps in between to calibrate it.

Just wanted to understand if the prediction probabilities of yggdrasil-decision-forests are calibrated?

If not, how can be calibrate it?

Cannot compile standalone example on macOS

Hi,

I am interested in using the Yggdrasil library for a C++ project, but I am having issues compiling the dylib on macOS.

In fact, I am not even able to compile the standalone example provided. Given the latest commit (as of Dec 18), I have tried modifying the compile_and_run.sh file so that line 29 reads:

bazel build --config=macos //:beginner_cc

Nevertheless, I get an error related to @org_tensorflow being unresolvable:

(...)
[1 / 1] checking cached actions
 665   β”‚     Fetching repository @@bazel_tools~cc_configure_extension~local_config_cc; starting
 666   β”‚     Fetching repository @@rules_java; starting
 667   β”‚     Fetching https://github.com/boostorg/math/archive/refs/tags/boost-1.83.0.tar.gz
 668   β”‚     Fetching https://github.com/protocolbuffers/protobuf/archive/refs/tags/v3.19.6.zip
 669   β”‚     Fetching https://github.com/bazelbuild/rules_java/releases/download/6.0.0/rules_java-6.0.0.tar.gz
 670   β”‚ ERROR: no such package '@@org_tensorflow//tensorflow/core': The repository '@@org_tensorflow' could not be resolved: Repository '@@org_tensorflow' is not defined
 671   β”‚ Analyzing: target //:beginner_cc (109 packages loaded, 663 targets configured)
 672   β”‚     currently loading: @@com_google_protobuf//
 673   β”‚ [1 / 1] checking cached actions
 674   β”‚     Fetching repository @@bazel_tools~cc_configure_extension~local_config_cc; starting
 675   β”‚     Fetching https://github.com/boostorg/math/archive/refs/tags/boost-1.83.0.tar.gz
 676   β”‚     Fetching https://github.com/protocolbuffers/protobuf/archive/refs/tags/v3.19.6.zip
 677   β”‚     Fetching https://github.com/bazelbuild/rules_java/releases/download/6.0.0/rules_java-6.0.0.tar.gz
 678   β”‚ ERROR: /private/var/tmp/_bazel_c.cardona/e5bb2b2198d9d78378b014e5c1a97558/external/ydf/yggdrasil_decision_forests/dataset/tensorflow/BUILD:8:15: no such package '@@org_tensorflow//t
       β”‚ ensorflow/core': The repository '@@org_tensorflow' could not be resolved: Repository '@@org_tensorflow' is not defined and referenced by '@@ydf//yggdrasil_decision_forests/dataset/t
       β”‚ ensorflow:tensorflow'
 679   β”‚ Analyzing: target //:beginner_cc (109 packages loaded, 663 targets configured)
 680   β”‚     currently loading: @@com_google_protobuf//
 681   β”‚ [1 / 1] checking cached actions
 682   β”‚     Fetching repository @@bazel_tools~cc_configure_extension~local_config_cc; starting
 683   β”‚     Fetching https://github.com/boostorg/math/archive/refs/tags/boost-1.83.0.tar.gz
 684   β”‚     Fetching https://github.com/protocolbuffers/protobuf/archive/refs/tags/v3.19.6.zip
 685   β”‚     Fetching https://github.com/bazelbuild/rules_java/releases/download/6.0.0/rules_java-6.0.0.tar.gz
 686   β”‚ ERROR: Analysis of target '//:beginner_cc' failed; build aborted: Analysis failed
 687   β”‚ Analyzing: target //:beginner_cc (109 packages loaded, 663 targets configured)
 688   β”‚     currently loading: @@com_google_protobuf//
 689   β”‚ [1 / 1] checking cached actions
 690   β”‚     Fetching repository @@bazel_tools~cc_configure_extension~local_config_cc; starting
 691   β”‚     Fetching https://github.com/boostorg/math/archive/refs/tags/boost-1.83.0.tar.gz
 692   β”‚     Fetching https://github.com/protocolbuffers/protobuf/archive/refs/tags/v3.19.6.zip
 693   β”‚     Fetching https://github.com/bazelbuild/rules_java/releases/download/6.0.0/rules_java-6.0.0.tar.gz
 694   β”‚ INFO: Elapsed time: 58.465s, Critical Path: 0.14s
 695   β”‚ Analyzing: target //:beginner_cc (109 packages loaded, 663 targets configured)
 696   β”‚     currently loading: @@com_google_protobuf//
 697   β”‚ [1 / 1] checking cached actions
 698   β”‚     Fetching repository @@bazel_tools~cc_configure_extension~local_config_cc; starting
 699   β”‚     Fetching https://github.com/boostorg/math/archive/refs/tags/boost-1.83.0.tar.gz
 700   β”‚     Fetching https://github.com/protocolbuffers/protobuf/archive/refs/tags/v3.19.6.zip
 701   β”‚     Fetching https://github.com/bazelbuild/rules_java/releases/download/6.0.0/rules_java-6.0.0.tar.gz
 702   β”‚ INFO: 1 process: 1 internal.
 703   β”‚ Analyzing: target //:beginner_cc (109 packages loaded, 663 targets configured)
 704   β”‚     currently loading: @@com_google_protobuf//
 705   β”‚ [1 / 1] checking cached actions
 706   β”‚     Fetching repository @@bazel_tools~cc_configure_extension~local_config_cc; starting
 707   β”‚     Fetching https://github.com/boostorg/math/archive/refs/tags/boost-1.83.0.tar.gz
 708   β”‚     Fetching https://github.com/protocolbuffers/protobuf/archive/refs/tags/v3.19.6.zip
 709   β”‚     Fetching https://github.com/bazelbuild/rules_java/releases/download/6.0.0/rules_java-6.0.0.tar.gz
 710   β”‚ ERROR: Build did NOT complete successfully
 711   β”‚ FAILED:
 712   β”‚     Fetching repository @@bazel_tools~cc_configure_extension~local_config_cc; starting
 713   β”‚     Fetching https://github.com/boostorg/math/archive/refs/tags/boost-1.83.0.tar.gz
 714   β”‚     Fetching https://github.com/protocolbuffers/protobuf/archive/refs/tags/v3.19.6.zip
 715   β”‚     Fetching https://github.com/bazelbuild/rules_java/releases/download/6.0.0/rules_java-6.0.0.tar.gz
 716   β”‚ FAILED:
 717   β”‚     Fetching repository @@bazel_tools~cc_configure_extension~local_config_cc; starting
 718   β”‚     Fetching https://github.com/boostorg/math/archive/refs/tags/boost-1.83.0.tar.gz
 719   β”‚     Fetching https://github.com/protocolbuffers/protobuf/archive/refs/tags/v3.19.6.zip
 720   β”‚     Fetching https://github.com/bazelbuild/rules_java/releases/download/6.0.0/rules_java-6.0.0.tar.gz

I saw that in line 19 of WORKSPACE there is a flag that seems to exclude tensorflow from the workspace, so I tried removing that. Unfortunately, it did not solve the issue and a similar error related to tensorflow popped up. I have admittedly little idea of how bazel works and would appreciate any help you could provide. Thank you!

extract model input features with dtypes

hi,

if i run tensorflow saved_model_cli, you see 3 inputs: DT_INT64, DT_STRING, DT_FLOAT and their corresponding feature names and 1 output:

$ saved_model_cli show --dir ./ --tag_set serve --signature_def serving_default
The given SavedModel SignatureDef contains the following input(s):
  inputs['foo'] tensor_info:
      dtype: DT_INT64
      shape: (-1)
      name: serving_default_foo:0
  inputs['bar'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: serving_default_bar:0
  inputs['foobar'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1)
      name: serving_default_foobar:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['output_1'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 1)
      name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict

im trying to get the input feature names + datatypes as shown above - is there a way to get it from the model using this api? i only managed to read the input feature names so far:

package main

import (
	"log"
	"fmt"
	"github.com/google/yggdrasil-decision-forests/yggdrasil_decision_forests/port/go/model/io"
)

func main() {
	modelPath := "../foo/tf-model/assets"
	model, err := io.LoadModel(modelPath)
	if err != nil {
		log.Fatalf("Cannot load model. %v", err)
	}

	// Get Dataspec from the rfModel
	dataspec := rfModel.Dataspec()

	// Extract feature information
	for i, featureName := range dataspec.GetColumns() {
		fmt.Println(i, featureName.GetName(), featureName.GetType())
	}
}

here is what i get - how can i map that to DT_INT64, DT_STRING, DT_FLOAT as shown above for the saved_model_cli? why is saved_model_cli showing 3 dtypes while in yggdrasil you only shows 2 dtypes (numeric = float, categorical = string)?

0 foo NUMERICAL
1 bar CATEGORICAL
2 foobar NUMERICAL
3 __LABEL NUMERICAL

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.