Cannot use 'discretize_numerical_columns' in tuner

Yggdrasil Decision Forests (YDF) is a production-grade collection of algorithms developed in Google Switzerland 🏔️ since 2018 for the training, serving, and interpretation of decision forest models. YDF is available in Python, C++, CLI, in TensorFlow under the name TensorFlow Decision Forests, JavaScript (inference only), and Go (inference only).

To learn more about YDF, see the documentation.

For more information on the design of YDF, see our paper at KDD 2023: Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library.

Key features

A simple API for training, evaluation and serving of decision forests models.
Supports Random Forest, Gradient Boosted Trees and Carts, and advanced learning algorithm such as oblique splits, honest trees, hessian and non-hessian scores, and global tree optimizations.
Train classification, regression, ranking, and uplifting models.
Fast model inference in cpu (microseconds / example / cpu-core).
Supports distributed training over billions of examples.
Serving in Python, C++, TensorFlow Serving, Go, JavaScript, and CLI.
Rich report for model description (e.g., training logs, plot trees), analysis (e.g., variable importances, partial dependence plots, conditional dependence plots), evaluation (e.g., accuracy, AUC, ROC plots, RMSE, confidence intervals), tuning (trials configuration and scores), and cross-validation.
Natively consumes numerical, categorical, boolean, text, and missing values.
Backward compatibility for model and learners since 2018.
Consumes Pandas Dataframes, Numpy arrays, TensorFlow Dataset and CSV files.

Installation

To install YDF in Python from PyPi, run:

pip install ydf

Usage example

Example with the Python API.

import ydf
import pandas as pd

train_ds = pd.read_csv("adult_train.csv")
test_ds = pd.read_csv("adult_test.csv")

# Train a model
model = ydf.GradientBoostedTreesLearner(label="income").train(train_ds)

# Look at a model (input features, training logs, structure, etc.)
model.describe()

# Evaluate a model (e.g. roc, accuracy, confusion matrix, confidence intervals)
model.evaluate(test_ds)

# Generate predictions
model.predict(test_ds)

# Analyse a model (e.g. partial dependence plot, variable importance)
model.analyze(test_ds)

# Benchmark the inference speed of a model
model.benchmark(test_ds)

# Save the model
model.save("/tmp/my_model")

Example with the C++ API.

auto dataset_path = "csv:train.csv";

// List columns in training dataset
DataSpecification spec;
CreateDataSpec(dataset_path, false, {}, &spec);

// Create a training configuration
TrainingConfig train_config;
train_config.set_learner("RANDOM_FOREST");
train_config.set_task(Task::CLASSIFICATION);
train_config.set_label("my_label");

// Train model
std::unique_ptr<AbstractLearner> learner;
GetLearner(train_config, &learner);
auto model = learner->Train(dataset_path, spec);

// Export model
SaveModel("my_model", model.get());

(based on examples/beginner.cc)

The same model can be trained in Python using TensorFlow Decision Forests as follows:

import tensorflow_decision_forests as tfdf
import pandas as pd

# Load dataset in a Pandas dataframe.
train_df = pd.read_csv("project/train.csv")

# Convert dataset into a TensorFlow dataset.
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="my_label")

# Train model
model = tfdf.keras.RandomForestModel()
model.fit(train_ds)

# Export model.
model.save("project/model")

Next steps

Check the Getting Started tutorial 🧭.

Google I/O Presentation

Yggdrasil Decision Forests powers TensorFlow Decision Forests.

Citation

If you us Yggdrasil Decision Forests in a scientific publication, please cite the following paper: Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library.

Bibtex

@inproceedings{GBBSP23,
  author       = {Mathieu Guillame{-}Bert and
                  Sebastian Bruch and
                  Richard Stotz and
                  Jan Pfeifer},
  title        = {Yggdrasil Decision Forests: {A} Fast and Extensible Decision Forests
                  Library},
  booktitle    = {Proceedings of the 29th {ACM} {SIGKDD} Conference on Knowledge Discovery
                  and Data Mining, {KDD} 2023, Long Beach, CA, USA, August 6-10, 2023},
  pages        = {4068--4077},
  year         = {2023},
  url          = {https://doi.org/10.1145/3580305.3599933},
  doi          = {10.1145/3580305.3599933},
}

Raw

Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library, Guillame-Bert et al., KDD 2023: 4068-4077. doi:10.1145/3580305.3599933

Contact

You can contact the core development team at [email protected].

Credits

Yggdrasil Decision Forests and TensorFlow Decision Forests are developed by:

Mathieu Guillame-Bert (gbm AT google DOT com)
Jan Pfeifer (janpf AT google DOT com)
Sebastian Bruch (sebastian AT bruch DOT io)
Richard Stotz (richardstotz AT google DOT com)
Arvind Srinivasan (arvnd AT google DOT com)

Contributing

Contributions to TensorFlow Decision Forests and Yggdrasil Decision Forests are welcome. If you want to contribute, check the contribution guidelines.

License

Apache License 2.0

Index	Feature_a	Feature_b	Label
0	0	1	0
1	0	0	0
2	1	1	1
3	0	1	0

Index	Feature_a	Feature_b	Label
0	0	1	0
1	0	0	0
2	0	1	0

Index	Feature_a	Feature_b	Label
0	0	1	0
1	0	0	0
2	0	1	0

Model/Dataset	Breast Cancer	Digits
XGBoost	0.1072	0.1606
Yggdrasil	0.0944	0.2350

	#### [use_hessian_gain](../yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.proto?q=symbol:use_hessian_gain)

	- Type: Categorical Default: false Possible values: true, false

	- Use true, uses a formulation of split gain with a hessian term i.e.
	optimizes the splits to minimize the variance of "gradient / hessian.
	Available for all losses except regression.

	if (config.gbt_config->use_hessian_gain() &&
	gradients.front().hessian_col_idx == -1) {
	return absl::InvalidArgumentError(
	"Loss does not support hessian optimization");
	}

	if (loss_shape.has_hessian) {
	const auto hessian_col_name = HessianColumnName(gradient_idx);
	dataset::proto::Column hessian_col_spec;
	hessian_col_spec.set_name(hessian_col_name);
	hessian_col_spec.set_type(dataset::proto::ColumnType::NUMERICAL);

	// Note: These values will be set correctly before use.
	gradient.hessian =
	dynamic_cast<dataset::VerticalDataset::NumericalColumn*>(
	gradient_dataset->AddColumn(hessian_col_spec).value())
	->mutable_values();
	gradient.hessian_col_idx =
	gradient_dataset->ColumnNameToColumnIdx(hessian_col_name);

	class MeanSquaredErrorLoss : public AbstractLoss {
	public:
	MeanSquaredErrorLoss(
	const proto::GradientBoostedTreesTrainingConfig& gbt_config,
	model::proto::Task task, const dataset::proto::Column& label_column);

	absl::Status Status() const override;

	LossShape Shape() const override {
	return LossShape{/.gradient_dim =/1, /.prediction_dim =/1,
	/.has_hessian =/false};
	};

	class BinomialLogLikelihoodLoss : public AbstractLoss {
	public:
	BinomialLogLikelihoodLoss(
	const proto::GradientBoostedTreesTrainingConfig& gbt_config,
	model::proto::Task task, const dataset::proto::Column& label_column);

	absl::Status Status() const override;

	LossShape Shape() const override {
	return LossShape{/.gradient_dim =/1,
	/.prediction_dim =/1,
	/.has_hessian =/gbt_config_.use_hessian_gain()};
	};

	case model::proto::Task::CLASSIFICATION: {
	if (internal_config.use_hessian_gain) {
	return absl::InternalError("Expect use_hessian_gain=false");
	}

	case model::proto::Task::REGRESSION: {
	if (internal_config.use_hessian_gain) {
	RegressionHessianLabelStats label_stat(
	train_dataset

	if (gbt_config_.use_hessian_gain()) {
	auto* reg = node->mutable_node()->mutable_regressor();
	reg->set_sum_gradients(numerator);
	reg->set_sum_hessians(denominator);
	reg->set_sum_weights(sum_weights);

	absl::Status MeanSquaredErrorLoss::Status() const {
	if (task_ != model::proto::Task::REGRESSION &&
	task_ != model::proto::Task::RANKING) {
	return absl::InvalidArgumentError(
	"Mean squared error loss is only compatible with a "
	"regression or ranking task");
	}

	absl::Status BinomialLogLikelihoodLoss::Status() const {
	if (task_ != model::proto::Task::CLASSIFICATION)
	return absl::InvalidArgumentError(
	"Binomial log likelihood loss is only compatible with a "
	"classification task");

	case model::proto::Task::REGRESSION: {
	if (internal_config.use_hessian_gain) {
	RegressionHessianLabelStats label_stat(
	train_dataset
	.ColumnWithCast<dataset::VerticalDataset::NumericalColumn>(
	config_link.label())
	->values(),
	train_dataset
	.ColumnWithCast<dataset::VerticalDataset::NumericalColumn>(
	internal_config.hessian_col_idx)
	->values());

	label_stat.sum_gradient = parent.regressor().sum_gradients();
	label_stat.sum_hessian = parent.regressor().sum_hessians();
	label_stat.sum_weights = parent.regressor().sum_weights();

google / yggdrasil-decision-forests Goto Github PK

yggdrasil-decision-forests's Introduction

Key features

Installation

Usage example

Next steps

Google I/O Presentation

Citation

Contact

Credits

Contributing

License

yggdrasil-decision-forests's People

Contributors

Stargazers

Watchers

Forkers

yggdrasil-decision-forests's Issues

Splits

Xgboost split

Split 1

Split 2

Yggdrasil split

Split 1

Split 2

Explanation

Comparison

Setup

Datasets

Sklearn - Breast Cancer Dataset

Sklearn - Digit Dataset

Comparison

Parameter

Yggdrasil

XGBoost

Conclusion

S: use_hessian_gain "Available for all losses except regression."

A: hessian available for LogLikelihood, not MSE

A.1 . Error when use_hessian_gain = true and hessian_col_idx not set

A.2 gradient.hessian_col_idx set iff hessian = true

A.3 has hessian = false for MSE

A.4 has hessian = use_hessian_gain for LogLikelihoodLoss

B hessian available for Regression, not Classification

B.1 Error when use_hessian_gain on Task::CLASSIFICATION

B.2 No error when use_hessian_gain on Task::REGRESSION

C hessian only trainable on Regression task with LogLikelihood Loss

C.1 : G, H, W (sum of gradience hessian and weights) available for LogLikelihoodLoss, nothing else

C.2: G, H, W only accessed on Task::REGRESSION, nothing else

D: MSE = Regression+Ranking, LogLikelihood = Classification

D.1: MSE only available for Regression or Ranking, not Classification

D.2 LogLikelihood Loss only available for Classification

Conclusion draw from S, A, B, C, D:

Recommend Projects

Recommend Topics

Recommend Org