Git Product home page Git Product logo

mgbdt's Introduction

Multi-Layered Gradient Boosting Decision Trees

This is the official clone for the implementation of mGBDT.

Package Official Website: http://lamda.nju.edu.cn/code_mGBDT.ashx

This package is provided "AS IS" and free for academic usage. You can run it at your own risk. For other purposes, please contact Prof. Zhi-Hua Zhou ([email protected]).

Description: A python implementation of mGBDT proposed in [1]. A demo implementation of mGBDT library as well as some demo client scripts to demostrate how to use the code. The implementation is flexible enough for modifying the model or fit your own datasets.

Reference: [1] J. Feng, Y. Yu, and Z.-H. Zhou. Multi-Layered Gradient Boosting Decision Trees. In:Advances in Neural Information Processing Systems 31 (NIPS'18), Montreal, Canada, 2018.

ATTN: This package was developed and maintained by Mr.Ji Feng(http://lamda.nju.edu.cn/fengj/) .For any problem concerning the codes, please feel free to contact Mr.Feng.([email protected]) or open some issues here.

Environments

  • The code is developed under Python 3.5, so create a Python 3.5 environment using anaconda at first
conda create -n mgbdt python=3.5 anaconda
  • Install the dependent packages
source activate mgbdt
conda install pytorch=0.1.12 cuda80 -c pytorch
pip install -r requirements.txt

Demo Code

from sklearn import datasets
from sklearn.model_selection import train_test_split

# For using the mgbdt library, you have to include the library directory into your python path.
# If you are in this repository's root directory, you can do it by using the following lines
import sys
sys.path.insert(0, "lib")

from mgbdt import MGBDT, MultiXGBModel

# make a sythetic circle dataset using sklearn
n_samples = 15000
x_all, y_all = datasets.make_circles(n_samples=n_samples, factor=.5, noise=.04, random_state=0)
x_train, x_test, y_train, y_test = train_test_split(x_all, y_all, test_size=0.3, random_state=0, stratify=y_all)

# Create a multi-layerd GBDTs
net = MGBDT(loss="CrossEntropyLoss", target_lr=1.0, epsilon=0.1)

# add several target-propogation layers
# F, G represent the forward mapping and inverse mapping (in this paper, we use gradient boosting decision tree)
net.add_layer("tp_layer",
    F=MultiXGBModel(input_size=2, output_size=5, learning_rate=0.1, max_depth=5, num_boost_round=5),
    G=None)
net.add_layer("tp_layer",
    F=MultiXGBModel(input_size=5, output_size=3, learning_rate=0.1, max_depth=5, num_boost_round=5),
    G=MultiXGBModel(input_size=3, output_size=5, learning_rate=0.1, max_depth=5, num_boost_round=5))
net.add_layer("tp_layer",
    F=MultiXGBModel(input_size=3, output_size=2, learning_rate=0.1, max_depth=5, num_boost_round=5),
    G=MultiXGBModel(input_size=2, output_size=3, learning_rate=0.1, max_depth=5, num_boost_round=5))

# init the forward mapping
net.init(x_train, n_rounds=5)

# fit the dataset
net.fit(x_train, y_train, n_epochs=50, eval_sets=[(x_test, y_test)], eval_metric="accuracy")

# prediction
y_pred = net.forward(x_test)

# get the hidden outputs
# hiddens[0] represent the input data
# hiddens[1] represent the output of the first layer
# hiddens[2] represent the output of the second layer
# hiddens[3] represent the output of the final layer (same as y_pred)
hiddens = net.get_hiddens(x_test)

Expriments

circle dataset

By running the following scripts

  • It will train a multi-layered GBDTs with structure (input - 5 - 3 - output) on the sythetic circle dataset
  • The visualization of the input (which is 2D) will be saved in outputs/circle/input.jpg (as show below)
  • The visualization of the second layer's output (which is 3D) will be saved in outputs/circle/pred2.jpg (as show below)
python exp/circle.py
Input Transformed

scurve dataset

By running the following scripts

  • It will train an autoencoder using multi-layered GBDTs with structure (input - 5 - output) on the sythetic scurve dataset
  • The visualization of the input (which is 3D) will be saved in outputs/circle/input.jpg (as show below)
  • The visualization of the resonstructed result (which is 3D) will be saved in outputs/circle/pred2.jpg (as show below)
python exp/scurve.py
Input Reconstructed
  • The visualization of the encoding will also be saved, since the 5D encodings are impossible to visualize directly, here we visualize every pairs of the 5 dimentions in 2D space.
  • The visualization of the $i'th and $j'th dimension will be saved in outputs/scurve/pred1.$i_$j.jpg (as show below)
Dimension 1 and 2 Dimension 1 and 5

At first, you need to download the dataset by running the following command:

cd dataset/uci_adult
sh get_data.sh

Then, by running the following scripts

  • It will train a multi-layered GBDTs with structure (input - 128 - 128 - output)
  • the accuracy will be logged for each epochs
python exp/uci_adult.py

By running the following scripts

  • It will train a multi-layered GBDTs with structure (input - 16 - 16 - output)
  • 10-fold cross-validation is used
  • the accuracy will be logged for each epochs and each folds
python exp/uci_yeast.py

Happy hacking.

mgbdt's People

Contributors

kingfengji avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

mgbdt's Issues

Performance of your model on regression tasks

Description

@kingfengji Thanks for making the code available. I believe that multi-layered decision trees is a very elegant and powerful approach! I was applying your model to the boston housing dataset but wasn't able to outperform a baseline xgboost model.

Details

To compare your approach to several alternatives, I ran a small benchmark study using the following approaches, where all models have the same hyper-parameters

  • baseline xgboost model (xgboost)
  • mGBDT with xgboost for hidden and output layer (mGBDT_XGBoost)
  • mGBDT with xgboost for hidden but with linear model for output layer (mGBDT_Linear)
  • linear model as implemented here (Linear)

I am using PyTorch's L1Loss for model training and use the MAE for evaluation, where all models are trained in serial mode. Results are as follows

image

In particular, I observe the following

  • irresepective of the hyper-parameters and number of epochs, a basline xgboost model tends to outperforms your approach
  • with increasing number of epochs, the runtime for an epoch increases considerably. Any idea as to why this happens?
  • using mGBDT_Linear,
    • I wasn't able to use PyTorch's MSELoss since the loss exploded after some iterations, even after normalizing X. Should we, similar to Neural Networks, also scale y to avoid exploding gradients?
    • the training loss starts at exceptionally high values, then decreases before it starts to increase again

Additional Questions

  • Given that you have mostly been using your approach for classification tasks, is there anything we need to change before we use it for regression tasks, except the PyTorch Loss?
  • Besides the loss of F, can we also track how well the target propagation is working by evaluating the reconstruction loss of G?
  • When using mGBDT with a linear output layer, would we expect to generally see better results compared to using xgboost for the output layer?
  • What is the benefit of using a linear output layer compared to a xgboost layer?
  • For training F and G, you are currently using the MSELoss for the xgboost models. Do you have some experience with modifying this loss?
  • What is the effect of the number of iterations for initializing the model before training?
  • What is the relationship between the number of boosting iterations (for xgboost training) and the number of epochs (for MGBDT training)?
  • In Section 4 of your paper you state "The experiments for this section is mainly designed to empirically examine if it is feasible to jointly train the multi-layered structure proposed by this work. That is, we make no claims that the current structure can outperform CNNs in computer vision tasks." So as a question, would that mean that your intention is not to outperform existing Deep Learning based models, say CNN, or to outperform existing GBM-models, like XGBoost, but rather to show that a Decision Tree based model can be also used for learning meaningful representations that can then be used for downstreaming tasks?
  • Connected to the previous question: Gradient boosting models are already very strong learners that obtain very good results in many applications. So what would be your motivation of using multiple layers of such a model? May it even happen that, based on the implicit error correction mechanism of GBM, training several of them leads to a drop in accuracy?

Code

To reproduce the results, you an use the attached notebook.

ModelComparison.zip

@kingfengji I would highly appreciate your feedback. Many thanks.

Problem with Pop

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in (.0)
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
255 return [func(*args, **kwargs)
--> 256 for func, args, kwargs in self.items]
257
258 def len(self):

/kaggle/working/mGBDT/lib/mgbdt/model/online_xgb.py in fit_increment(self, X, y, num_boost_round, params)
13 for k, v in extra_params.items():
14 params[k] = v
---> 15 params.pop("n_estimators")
16
17 if callable(self.objective):

KeyError: 'n_estimators'

Environment

I feel that code written in python 3.5 would likely be compatible with other python 3 versions, are you sure that a build is necessary in 3.5?

Can not find the uci dataset

Hi,
I wanna run the uci_year and uci_adult demo, but I can't find the get_data.sh files as ReadME said. Would you please upload it or tell me the data format so I can handle it by myself.
I find that the code uses features file, but it is not in the git too.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.